{"id":24438383,"url":"https://github.com/horseee/LLM-Pruner","last_synced_at":"2025-10-01T08:30:28.494Z","repository":{"id":167557967,"uuid":"642028103","full_name":"horseee/LLM-Pruner","owner":"horseee","description":"[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA,  BLOOM, Vicuna, Baichuan, TinyLlama, etc.","archived":false,"fork":false,"pushed_at":"2024-10-07T08:49:26.000Z","size":6209,"stargazers_count":1066,"open_issues_count":64,"forks_count":131,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-09-30T07:04:49.963Z","etag":null,"topics":["baichuan","bloom","chatglm","compression","language-model","llama","llama-2","llama3","llm","neurips-2023","pruning","pruning-algorithms","vicuna"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2305.11627","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/horseee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-17T16:58:28.000Z","updated_at":"2025-09-27T13:01:31.000Z","dependencies_parsed_at":"2023-12-19T09:42:54.334Z","dependency_job_id":"dad46a51-73af-4186-8da5-05ff0ac40c5e","html_url":"https://github.com/horseee/LLM-Pruner","commit_stats":{"total_commits":154,"total_committers":4,"mean_commits":38.5,"dds":"0.39610389610389607","last_synced_commit":"128a07d977f9b205d60ab14cfbc6a78f8a8e39d2"},"previous_names":["horseee/llm-pruner"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/horseee/LLM-Pruner","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/horseee%2FLLM-Pruner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/horseee%2FLLM-Pruner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/horseee%2FLLM-Pruner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/horseee%2FLLM-Pruner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/horseee","download_url":"https://codeload.github.com/horseee/LLM-Pruner/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/horseee%2FLLM-Pruner/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277777981,"owners_count":25875397,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-30T02:00:09.208Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["baichuan","bloom","chatglm","compression","language-model","llama","llama-2","llama3","llm","neurips-2023","pruning","pruning-algorithms","vicuna"],"created_at":"2025-01-20T19:02:05.182Z","updated_at":"2025-10-01T08:30:28.487Z","avatar_url":"https://github.com/horseee.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","GitHub projects","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/logo.png\" width=\"20%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003ch1\u003eLLM-Pruner\u003c/h1\u003e\n  \u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\n    \u003cimg alt=\"License: Apache 2.0\" src=\"https://img.shields.io/badge/License-Apache%202.0-4E94CE.svg\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pytorch.org/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/PyTorch-%3E=v1.7.1-EE4C2C.svg?style=flat-square\" alt=\"PyTorch\u003e=v1.7.1\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/facebookresearch/llama\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-LLaMA-FFB000.svg?style=flat-square\" alt=\"LLaMA\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/facebookresearch/llama\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-Llama2-FAB093.svg?style=flat-square\" alt=\"Llama-2\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/facebookresearch/llama\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-Llama3\u00263.1-7CC217.svg?style=flat-square\" alt=\"Llama-3\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/lm-sys/FastChat\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-Vicuna-924E7D.svg?style=flat-square\" alt=\"Vicuna\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/docs/transformers/model_doc/bloom\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-BLOOM-1A63BD.svg?style=flat-square\" alt=\"BLOOM\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/THUDM/ChatGLM-6B\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-chatGLM-6082B6.svg?style=flat-square\" alt=\"chatGLM\"\u003e\n  \u003c/a\u003e\n    \u003ca href=\"https://github.com/baichuan-inc/Baichuan-7B\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/LLMs-Baichuan-18ac62.svg?style=flat-square\" alt=\"Baichuan\"\u003e\n  \u003c/a\u003e\n\u003c/div\u003e\n\u003ch3\u003eOn the Structural Pruning of Large Language Models\u003ch3\u003e\n:llama: :llama: :llama: :llama: :llama: Compress your LLMs to any size! :llama: :llama: :llama: :llama: :llama:\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"100%\" alt=\"image\" src=\"figures/intro.png\"\u003e    \n\u003cimg src=\"figures/LLaMA_example.png\" width=\"100%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\n\n## Introduction\n  \n\u003e **[LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627)** [[arXiv]](https://arxiv.org/abs/2305.11627)   \n\u003e *Xinyin Ma, Gongfan Fang, Xinchao Wang*   \n\u003e *National University of Singapore*  \n\n#### Why LLM-Pruner\n- [x] **Task-agnostic compression**: The compressed LLM should retain its original ability as a multi-task solver. \n- [x] **Less training corpus**: In this work, we use only 50k publicly available samples (alpaca) to post-train the LLM.  \n- [x] **Efficient compression**: 3 minutes for pruning and 3 hours for post-training. (You can make it longer)\n- [x] **Automatic structural pruning**: Pruning new LLMs with minimal human effort (In progress).\n\n#### Supported LLMs:\n- [x] [Llama-3.1](https://huggingface.co/collections/meta-llama/llama-31-669fc079a0c406a149a5738f)\n- [x] [Llama-3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6)\n- [x] [Llama-2](https://github.com/horseee/LLM-Pruner#1-pruning-discovery-stage--estimation-stage)\n- [x] [LLaMA](https://github.com/horseee/LLM-Pruner#1-pruning-discovery-stage--estimation-stage)\n- [x] [BLOOM](https://github.com/horseee/LLM-Pruner/tree/main/examples#cherry_blossom-bloom) \n- [x] [Vicuna](https://github.com/horseee/LLM-Pruner#llama-vicuna-pruning)\n- [x] [Baichuan](https://github.com/horseee/LLM-Pruner/tree/main/examples#llama-baichuan-pruning)\n- [x] [TinyLlama](https://github.com/jzhang38/TinyLlama) \n\n#### Updates:\n* July 27, 2024: :rocket: Support GQA! Now LLM-Pruner can work on Llama3 and Llama 3.1. We are still testing the pruning results of new LLMs (Llama3, Llama3.1, Gemma) and you can find the pruning results [here](https://github.com/horseee/LLM-Pruner/tree/main/more_results#more-results).\n* August 30, 2023: LLM-Pruner now supports [BLOOM](https://huggingface.co/docs/transformers/model_doc/bloom) :cherry_blossom:\n* August 14, 2023:  [Code](https://github.com/horseee/LLM-Pruner#2-post-training-recover-stage) and [results](https://github.com/horseee/LLM-Pruner#2-post-training-recover-stage) for finetuning with a large-scale corpus are now available. The fine-tuned LLaMA-5.4B model achieves an average accuracy of 62.36%, closely approaching the original LLaMA-7B (63.25%).\n* July 19, 2023: :fire:  LLM-Pruner now supports Llama-2-7b and Llama-2-13b (the huggingface version) \n* July 18, 2023: :rocket: Support [Baichuan](https://github.com/baichuan-inc/Baichuan-7B), a bilingual LLM.\n* May 20, 2023: :tada: Code and Preprint Paper released! \n\n#### TODO List:\n- [ ] A tutorial for pruning new LLMs.\n- [ ] Support `.from_pretrained()` for loading the model.\n\n#### **Contact Us:**\nJoin our WeChat group for a chat:\n  * WeChat Group [Group-2](https://github.com/user-attachments/assets/3fe4c487-5a5b-43fd-bf64-a5ee62c3dec1)  (\u003e200/500), [Group-1](https://github.com/VainF/Torch-Pruning/assets/18592211/35d66130-eb03-4dcb-ad75-8df784460ad3) (500/500, FULL).\n\n\n\n## Table of Contents\n  - [Quick Start](#quick-start)\n  - [Step-by-step Instructions](#step-by-step-instructions)\n  - [Zero-shot Evaluation](#zero-shot-evaluation)\n  - [More-Examples](#more-examples)\n  - [Version Information](#version-information)\n  - [Limitations](#limitations)\n  - [Acknowledgement](#acknowledgement)\n  - [Citation](#citation)\n\n## Quick Start\n\n### Installation\n```\npip install -r requirement.txt\n```\n\n### Minimal Example\n```\nbash script/llama_prune.sh\n```\nThis script would compress the LLaMA-7B model with ～20\\% parameters pruned. All the pre-trained models and the dataset would be automatically downloaded, so you do not need to manually download the resource. When running this script for the first time, it will require some time to download the model and the dataset.\n\n    \n## Step-by-step Instructions  \n    \nIt takes three steps to prune an LLM:\n* \u003cu\u003eDiscovery Stage\u003c/u\u003e: Discover the complicated inter-dependency in LLMs and find the minimally-removable unit, **group**.\n* \u003cu\u003eEstimation Stage\u003c/u\u003e: Estimate the contribution of each group to the overall performance of the model and decide which group to prune. \n* \u003cu\u003eRecover Stage\u003c/u\u003e: Fast post-training to recover model performance.\n  \nAfter pruning and post-training, we follow \u003ca href=\"https://github.com/EleutherAI/lm-evaluation-harness\"\u003elm-evaluation-harness\u003c/a\u003e for evaluation.\n    \n### 1. Pruning (Discovery Stage + Estimation Stage)\n    \n:llama: **LLaMA/Llama-2 pruning with ~20% parameters pruned:**\n```\npython hf_prune.py --pruning_ratio 0.25 \\\n      --block_wise \\\n      --block_mlp_layer_start 4 --block_mlp_layer_end 30 \\\n      --block_attention_layer_start 4 --block_attention_layer_end 30 \\\n      --pruner_type taylor \\\n      --test_after_train \\\n      --device cpu  --eval_device cuda \\\n      --save_ckpt_log_name llama_prune \n```\nArguments:\n- ``Base model``: Choose the base model from LLaMA or Llama-2 and pass the `pretrained_model_name_or_path` to `--base_model`. The model name is used for `AutoModel.from_pretrained` to load the pre-trained LLM. For example, if you want to use the llama-2 with 13 billion parameters, then pass `meta-llama/Llama-2-13b-hf` to `--base_model`.\n- ``Pruning Strategy``: Choose between block-wise, channel-wise, or layer-wise pruning using the respective command options: {--block_wise}, {--channel_wise}, {--layer_wise --layer NUMBER_OF_LAYERS}. For block-wise pruning, specify the start and end layers to be pruned. Channel-wise pruning does not require extra arguments. For layer pruning, use --layer NUMBER_OF_LAYERS to specify the desired number of layers to be kept after pruning.\n- ``Importance Criterion``: Select from l1, l2, random, or taylor using the --pruner_type argument. For the taylor pruner, choose one of the following options: vectorize, param_second, param_first, param_mix. By default, param_mix is used, which combines approximated second-order hessian and first-order gradient. If using l1, l2, or random, no extra arguments are required.\n- ``Pruning Ratio``: Specifies the pruning ratio of groups. It differs from the pruning rate of parameters, as groups are removed as the minimal units.\n- ``Device`` and ``Eval_device``: Pruning and evaluation can be performed on different devices. Taylor-based methods require backward computation during pruning, which may require significant GPU RAM. Our implementation uses the CPU for importance estimation (also supports GPU, simply use --device cuda). eval_device is used to test the pruned model.\n \n\n\n#### :llama: Vicuna Pruning\n\n\u003cdetails\u003e\n\u003csummary\u003eDetails:\u003c/summary\u003e\n  \nIf you want to try Vicuna, please specify the argument `--base_model` to the path to vicuna weight. Please follow \u003ca href=\"https://github.com/lm-sys/FastChat\"\u003ehttps://github.com/lm-sys/FastChat\u003c/a\u003e to get Vicuna weights.\n```\npython hf_prune.py --pruning_ratio 0.25 \\\n      --block_wise \\\n      --block_mlp_layer_start 4 --block_mlp_layer_end 30 \\\n      --block_attention_layer_start 4 --block_attention_layer_end 30 \\\n      --pruner_type taylor \\\n      --test_after_train \\\n      --device cpu  --eval_device cuda \\\n      --save_ckpt_log_name llama_prune \\\n      --base_model PATH_TO_VICUNA_WEIGHTS\n```\n\n\u003c/details\u003e\n\n\n#### :llama: Baichuan Pruning\n\n\u003cdetails\u003e\n\u003csummary\u003eDetails:\u003c/summary\u003e\n  \nPlease refer to the [Example/Baichuan](https://github.com/horseee/LLM-Pruner/tree/main/examples#llama-baichuan-pruning) for more details\n\n\u003c/details\u003e\n\n#### :llama: Llama3/Llama3.1 Pruning\n\n\u003cdetails\u003e\n\u003csummary\u003eDetails:\u003c/summary\u003e\n  \n```\npython llama3.py --pruning_ratio 0.25 \\\n                 --device cuda --eval_device cuda \\\n                 --base_model meta-llama/Meta-Llama-3-8B-Instruct \\\n                 --block_wise --block_mlp_layer_start 4 --block_mlp_layer_end 30 \\\n                 --block_attention_layer_start 4 --block_attention_layer_end 30 \\\n                 --save_ckpt_log_name llama3_prune \\\n                 --pruner_type taylor --taylor param_first \\\n                 --max_seq_len 2048 \\\n                 --test_after_train --test_before_train --save_model \n```\n\n\u003c/details\u003e\n    \n### 2. Post-Training (Recover Stage)\n\n* Train using Alpaca with 50,000 samples. Here's an example of training on a single GPU:\n```\nCUDA_VISIBLE_DEVICES=X python post_training.py --prune_model prune_log/PATH_TO_PRUNE_MODEL/pytorch_model.bin \\\n      --data_path yahma/alpaca-cleaned \\\n      --lora_r 8 \\\n      --num_epochs 2 \\ \n      --learning_rate 1e-4 \\ \n      --batch_size 64 \\\n      --output_dir tune_log/PATH_TO_SAVE_TUNE_MODEL \\ \n      --wandb_project llama_tune\n```\nMake sure to replace `PATH_TO_PRUNE_MODEL` with the path to the pruned model in step 1, and replace `PATH_TO_SAVE_TUNE_MODEL` with the desired location where you want to save the tuned model.\n\n**Tip**: [Training LLaMA-2 in float16 is not recommended and is known to produce nan; as such, the model should be trained in bfloat16.](https://huggingface.co/docs/transformers/model_doc/llama2#usage-tips)\n\n* Train using [MBZUAI/LaMini-instruction](https://huggingface.co/datasets/MBZUAI/LaMini-instruction) with 2.59M samples. Here is an example using multiple gpus for training:\n```\ndeepspeed --include=localhost:1,2,3,4 post_training.py \\\n      --prune_model prune_log/PATH_TO_PRUNE_MODEL/pytorch_model.bin \\\n      --data_path MBZUAI/LaMini-instruction  \\\n      --lora_r 8 \\\n      --num_epochs 3  \\\n      --output_dir tune_log/PATH_TO_SAVE_TUNE_MODEL \\\n      --extra_val_dataset wikitext2,ptb \\\n      --wandb_project llmpruner_lamini_tune \\\n      --learning_rate 5e-5 \\\n      --cache_dataset\n```\n\n### 3. Generation\n\n#### How to load pruned/pre-trained models:\n\nFor the pruned model, simply use the following command to load your model. \n``` \n  pruned_dict = torch.load(YOUR_CHECKPOINT_PATH, map_location='cpu')\n  tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']\n```\nDue to the different configurations between modules in the pruned model, where certain layers may have larger width while others have undergone more pruning, it becomes impractical to load the model using the `.from_pretrained()` as provided by Hugging Face. Currently, we employ the `torch.save` to store the pruned model.\n  \nSince the pruned model has different configuration in each layer, like some layers might be wider but some layers have been pruned more, the model cannot be loaded with the `.from_pretrained()` in Hugging Face. Currently, we simply use the `torch.save` to save the pruned model and `torch.load` to load the pruned model.\n  \n#### Generation with Gradio Interface\nWe provide a simple script to geneate texts using pre-trained / pruned models / pruned models with post-training. \n    \n* LLaMA-7B Pre-trained\n```\npython generate.py --model_type pretrain\n```\n* Pruned Model without Post-Training\n```\npython generate.py --model_type pruneLLM --ckpt \u003cYOUR_MODEL_PATH_FOR_PRUNE_MODEL\u003e\n```\n* Pruned Model with Post-Training \n```\npython generate.py --model_type tune_prune_LLM --ckpt \u003cYOUR_CKPT_PATH_FOR_PRUNE_MODEL\u003e --lora_ckpt \u003cYOUR_CKPT_PATH_FOR_LORA_WEIGHT\u003e\n```\n\nThe above instructions will deploy your LLMs locally. \n  \n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"figures/deploy.png\" width=\"100%\"\u003e\u003c/img\u003e\n\u003c/div\u003e\n\n\n### 4. Evaluation\nFor evaluating the performance of the pruned model, we follow [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the model:\n* Step 1: If you only need to evaluate the pruned model, then skip this step and jump to Step 2.\nThis step is to arrange the files to satisfy the input requirement for `lm-evaluation-harness`. The [tuned checkpoint from the post-training step](https://github.com/horseee/LLM-Pruner#2-post-training-recover-stage) would be save in the following format:\n```\n- PATH_TO_SAVE_TUNE_MODEL\n  | - checkpoint-200\n      | - pytorch_model.bin\n      | - optimizer.pt\n      ...\n  | - checkpoint-400\n  | - checkpoint-600\n  ...\n  | - adapter_config.bin\n  | - adapter-config.json\n```\nArrange the files by the following commands:\n```\ncd PATH_TO_SAVE_TUNE_MODEL\nexport epoch=YOUR_EVALUATE_EPOCH\ncp adapter_config.json checkpoint-$epoch/\nmv checkpoint-$epoch/pytorch_model.bin checkpoint-$epoch/adapter_model.bin\n```\nIf you want to evaluate the `checkpoint-200`, then set the epoch equalts to 200 by `export epoch=200`.\n\n\n* Step 2:\n```\nexport PYTHONPATH='.'\npython lm-evaluation-harness/main.py --model hf-causal-experimental \\\n       --model_args checkpoint=PATH_TO_PRUNE_MODEL,peft=PATH_TO_SAVE_TUNE_MODEL,config_pretrained=PATH_OR_NAME_TO_BASE_MODEL \\\n       --tasks openbookqa,arc_easy,winogrande,hellaswag,arc_challenge,piqa,boolq \\\n       --device cuda:0 --no_cache \\\n       --output_path PATH_TO_SAVE_EVALUATION_LOG \n```\nHere, replace `PATH_TO_PRUNE_MODEL` and `PATH_TO_SAVE_TUNE_MODEL` with the path you save the pruned model and the tuned model, and `PATH_OR_NAME_TO_BASE_MODEL` is for loading the configuration file of the base model. \n\n[Update]: We upload a script to simply the evaluation process if you want to evaluate the pruned model with the tuned checkpoint. Simply use the following command:\n```\nCUDA_VISIBLE_DEVICES=X bash scripts/evaluate.sh PATH_OR_NAME_TO_BASE_MODEL PATH_TO_SAVE_TUNE_MODEL  PATH_TO_PRUNE_MODEL EPOCHS_YOU_WANT_TO_EVALUATE\n```\nReplace the necessary information of your model in the command. The final one is used to iterate over different epochs if you want to evaluate several checkpoints in one command. For example:\n```\nCUDA_VISIBLE_DEVICES=1 bash scripts/evaluate.sh decapoda-research/llama-7b-hf tune_log/llama_7B_hessian prune_log/llama_prune_7B 200 1000 2000\n```\n\n\n### 5. Testing MACs, Params and Memory\n\n* Pre-trained\n```\npython test_speedup.py --model_type pretrain\n```\n* Pruned Model\n```\npython test_speedup.py --model_type pruneLLM --ckpt \u003cYOUR_MODEL_PATH_FOR_PRUNE_MODEL\u003e\n```\n\n## Zero-shot Evaluation\n\nA brief quantitative results for LLaMA-7B:\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/LLaMAResults.png\" width=\"100%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n    \nThe results for Vicuna-7B:\n    \n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/VicunaResults.png\" width=\"100%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n    \nThe results for ChatGLM-6B:\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/ChatGLMResults.png\" width=\"80%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\nStatistics for pruned models:\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"figures/statistic.png\" width=\"50%\"\u003e \u003cbr\u003e\n\u003c/p\u003e\n\nResults of LLM-Pruner with 2.59M samples:\n| Pruning Ratio | #Param | Memory     | Latency | Speedup | BoolQ | PIQA  | HellaSwag | WinoGrande | ARC-e | ARC-c | OBQA  | Average |\n|---------------|--------|------------|---------|---------|-------|-------|-----------|------------|-------|-------|-------|---------|\n| LLaMA-7B      | 6.74B  | 12884.5MiB | 69.32s  | 1x      | 73.18 | 78.35 | 72.99     | 67.01      | 67.45 | 41.38 | 42.40 | 63.25   |\n| LLaMA-5.4B with Alpaca(50k)    | 5.47B  | 10488.4MiB | 58.55s  | 1.18x   | 64.62 | 77.20 | 68.80     | 63.14      | 64.31 | 36.77 | 39.80 | 59.23   |\n| LLaMA-5.4B with LaMini(2.59M)  | 5.47B  | 10488.4MiB | 58.55s  | 1.18x   | 76.57 | 77.37 | 66.60     | 65.82      | 70.62 | 40.70 | 38.80 | 62.36   |\n\n\nMore results can be found in the paper. \n\n## More Examples\n\n| Original Model | Output | Pruned Model |  Output |\n|--|--|--|--| \n| LLaMa-7B | \u003cu\u003eThe universe is the entirety of space, time, matter, and energy that exists.\u003c/u\u003e It includes all matter and energy, whether visible or not, and in fact the whole universe is invisible. There are many theories regarding its origin and ultimate fate... | LLaMa-5.4B | \u003cu\u003eThe universe is the entirety of space, time, matter, and energy that exists.\u003c/u\u003e The laws that make up physics — electrons orbiting atoms, magnets attracting other objects, planets orbiting the Sun—have been consistent over ...|\n| Vicuna-7B | \u003cu\u003e10 steps to build an ios app:\u003c/u\u003e 1. Research and develop the idea. 2. Decide on the platform you want to use (iOS, Android, or both). 3. Create a plan and budget. 4. Hire a team of developers. 5. Design the user interface (UI). 6. Develop the user interface (UI). 7. Test and iterate on the app. 8. Submit the app to the app store. 9. Optimize the app for performance and user experience (UX). 10. Market and promote the app. | Vicuna-5.4B | \u003cu\u003e10 steps to build an ios app:\u003c/u\u003e 1. Plan your app: decide what it should do, what it will look like, and what it will cost to build it. 2. Hire a developer: A good iOS developer will help you design the user interface, create the app, and make sure it’s bug-free. 3. Create a user interface: This is where the magic happens. Your app’s user interface will be designed and mocked up at this point... |\n| ChatGLM-6B | \u003cu\u003eWrite a poetry about love and peace.\u003c/u\u003e In the midst of life's hustle and bustle, I find your love a soothing peace. It is a feeling that cannot be conveyed, A connection that cannot be distanced. With your smile, your eyes, your embrace, I feel a sense of belonging that only you can give. It is a love that lasts forever, A love that cannot be defeat... | ChatGLM-5.6B | \u003cu\u003eWrite a poetry about love and peace.\u003c/u\u003e Love and peace, two such wonderful things\\ A pure and endless desire in my heart\\ And both of them, I must seek for\\ A long, long time, I know..\\ Love, I know, is a feeling of being\\ A perfect partner, in every sense\\ And peace, I need it, so much, one day\\ A long, long way, my heart will go..|\n\n## Version Information\nDue to changes in the versions of models and repos used in this project, we listed some known version issues and the specific versions needed to reproduce our method:\n1. lm-eval-harness: We use [this commit](https://github.com/EleutherAI/lm-evaluation-harness/tree/4d21ab6b2713cc3a8b4fa7574e89c62ef504e75f) of lm-evaluation-harness, and the code is also included in this repo. Please check [Issue #25](https://github.com/horseee/LLM-Pruner/issues/25) for details.\n2. LLaMA1-7B: We use the checkpoint of [decapoda-research/llama-7b-hf](https://huggingface.co/decapoda-research/llama-7b-hf) in our experiments, which is not available now. Please consider using the copied version, e.g.,[baffo32/decapoda-research-llama-7B-hf](https://huggingface.co/baffo32/decapoda-research-llama-7B-hf).\n\n\n## Limitations\n* Although we only used 50K data and trained for three hours, more data would definitely be better. We are testing on this.\n* The current compressed model still has several issues, such as generating repetitive tokens or producing nonsensical sentences. We believe there is significant room for improvement in the quality of the compressed model.\n* There are still some models for which we cannot automatically identify the mapping of indexes after concatenation and view operations. Therefore, we need to perform additional manual operations. \n\n\n## Acknowledgement\n* Logo is generated by \u003ca href=\"https://dreamstudio.ai/generate\"\u003eStable Diffusion\u003c/a\u003e\n* The evaluation of the LLM:  \u003ca href=\"https://github.com/EleutherAI/lm-evaluation-harness\"\u003elm-evaluation-harness\u003c/a\u003e\n* LLaMA: \u003ca href=\"https://github.com/facebookresearch/llama\"\u003e https://github.com/facebookresearch/llama\u003c/a\u003e\n* Vicuna: \u003ca href=\"https://github.com/lm-sys/FastChat\"\u003ehttps://github.com/lm-sys/FastChat\u003c/a\u003e\n* Peft: \u003ca href=\"https://github.com/huggingface/peft\"\u003ehttps://github.com/huggingface/peft\u003c/a\u003e\n* Alpaca-lora: \u003ca href=\"https://github.com/tloen/alpaca-lora\"\u003ehttps://github.com/tloen/alpaca-lora\u003c/a\u003e\n\n## Citation\nIf you find this project useful, please cite\n```\n@inproceedings{ma2023llmpruner,\n  title={LLM-Pruner: On the Structural Pruning of Large Language Models},\n  author={Xinyin Ma and Gongfan Fang and Xinchao Wang},\n  booktitle={Advances in Neural Information Processing Systems},\n  year={2023},\n}\n```\n```\n@article{fang2023depgraph,\n  title={DepGraph: Towards Any Structural Pruning},\n  author={Fang, Gongfan and Ma, Xinyin and Song, Mingli and Mi, Michael Bi and Wang, Xinchao},\n  journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhorseee%2FLLM-Pruner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhorseee%2FLLM-Pruner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhorseee%2FLLM-Pruner/lists"}