{"id":13789280,"url":"https://github.com/Macaronlin/LLaMA3-Quantization","last_synced_at":"2025-05-12T05:32:12.571Z","repository":{"id":235355311,"uuid":"789743429","full_name":"Macaronlin/LLaMA3-Quantization","owner":"Macaronlin","description":"A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods..","archived":false,"fork":false,"pushed_at":"2025-01-14T17:56:25.000Z","size":2708,"stargazers_count":175,"open_issues_count":12,"forks_count":8,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-14T19:18:44.019Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Macaronlin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-21T12:44:48.000Z","updated_at":"2025-01-14T17:56:29.000Z","dependencies_parsed_at":"2024-04-23T06:28:14.339Z","dependency_job_id":"b40cdaab-6c04-432a-aa72-724b6a330a83","html_url":"https://github.com/Macaronlin/LLaMA3-Quantization","commit_stats":null,"previous_names":["macaronlin/llama3-quantization"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Macaronlin%2FLLaMA3-Quantization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Macaronlin%2FLLaMA3-Quantization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Macaronlin%2FLLaMA3-Quantization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Macaronlin%2FLLaMA3-Quantization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Macaronlin","download_url":"https://codeload.github.com/Macaronlin/LLaMA3-Quantization/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253682543,"owners_count":21946959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T21:01:01.039Z","updated_at":"2025-05-12T05:32:12.533Z","avatar_url":"https://github.com/Macaronlin.png","language":"Python","readme":"# LLaMA3-Quantization\n\nLLaMA3-Quantization is the official implementation of our paper How Good Are Low-bit Quantized LLAMA3 Models?\nAn Empirical Study [[PDF](https://arxiv.org/abs/2404.14047)]. Created by researchers from The University of Hong Kong, Beihang University and ETH Zürich.\n\n## Introduction\nMeta's LLaMa family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLaMa3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of low-bit quantization for LLMs in resource-limited scenarios, we explore LLaMa3's capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLaMa3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing post-training quantization and LoRA-finetuning methods of LLaMa3 on 1-8 bits and diverse datasets to comprehensively reveal LLaMa3's low-bit quantization performance. Our experiment results indicate that LLaMa3 still suffers non-negligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. Our project is released on [https://github.com/Macaronlin/LLaMA3-Quantization](https://github.com/Macaronlin/LLaMA3-Quantization) and quantized LLaMa3 models are released in [https://huggingface.co/Efficient-ML](https://huggingface.co/Efficient-ML).\n\n![img](images/overview.png)\n\n## Usage\n\nWe provide full script to evaluate various quantization methods in `./scripts/`. We use LLaMa-3-8B in IR-QLoRA method as an example here:\n\n```shell\npython main.py \\ \n    --model meta-llama/Meta-Llama-3-8B  \\ \n    --peft LLMQ/LLaMA-3-8B-IR-QLoRA \\ \n    --tau_range 0.1 --tau_n 100--blocksize 256 \\ \n    --epochs 0 \\ \n    --output_dir ./log/llama-3-8b-irqlora \\ \n    --wbits 4 \\ \n    --tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande\n```\n\n## Results\n\n### Track1: Post-Training Quantization\n\n- Evaluation results of post-training quantization on LLAMA3-8B model.\n  ![img](images/result_ptq_1.png)\n\n- Evaluation results of post-training quantization on LLAMA3-70B model.\n  ![img](images/result_ptq_2.png)\n\n### Track2: LoRA-FineTuning Quantization\n- LoRA-FT on LLAMA3-8B with Alpaca dataset.\n  ![img](images/result_lora_ft_1.png)\n\n## Related Project\n\n[QUIP](https://github.com/Cornell-RelaxML/QuIP)\n\n[GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers](https://github.com/IST-DASLab/gptq)\n\n[AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\n\n[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://github.com/mit-han-lab/llm-awq)\n\n[RPTQ: Reorder-Based Post-Training Quantization for Large Language Models](https://github.com/hahnyuan/RPTQ4LLM)\n\n[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://github.com/OpenGVLab/OmniQuant)\n\n[PB-LLM: Partially Binarized Large Language Models](https://github.com/hahnyuan/PB-LLM)\n\n[BiLLM: Pushing the Limit of Post-Training Quantization for LLMs](https://github.com/Aaronhuang-778/BiLLM)\n\n[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://github.com/mit-han-lab/smoothquant)\n\n[QLoRA: Efficient Finetuning of Quantized LLMs](https://github.com/artidoro/qlora)\n\n[IR-QLoRA: Accurate LoRA-Finetuning Quantization of LLMs via Information Retention](https://github.com/htqin/IR-QLoRA)\n\n\n\u003c!-- ## Citation\nIf you use our OmniQuant approach in your research, please cite our paper:\n\n```\n\n``` --\u003e\n","funding_links":[],"categories":["Tools"],"sub_categories":["Other"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMacaronlin%2FLLaMA3-Quantization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMacaronlin%2FLLaMA3-Quantization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMacaronlin%2FLLaMA3-Quantization/lists"}