{"id":21773518,"url":"https://github.com/thu-nics/MoA","last_synced_at":"2025-07-19T10:31:00.514Z","repository":{"id":245111463,"uuid":"817160481","full_name":"thu-nics/MoA","owner":"thu-nics","description":"The official implementation of the paper \u003cMoA: Mixture of Sparse Attention for Automatic Large Language Model Compression\u003e","archived":false,"fork":false,"pushed_at":"2024-11-12T03:20:54.000Z","size":517,"stargazers_count":97,"open_issues_count":0,"forks_count":6,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-12T04:23:05.733Z","etag":null,"topics":["large-language-models","model-compression","sparse-attention"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-nics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-19T06:31:47.000Z","updated_at":"2024-11-12T03:20:59.000Z","dependencies_parsed_at":"2024-08-12T10:01:15.583Z","dependency_job_id":null,"html_url":"https://github.com/thu-nics/MoA","commit_stats":null,"previous_names":["thu-nics/moa"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-nics%2FMoA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-nics%2FMoA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-nics%2FMoA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-nics%2FMoA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-nics","download_url":"https://codeload.github.com/thu-nics/MoA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226584451,"owners_count":17655036,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-language-models","model-compression","sparse-attention"],"created_at":"2024-11-26T17:01:31.946Z","updated_at":"2025-07-19T10:31:00.497Z","avatar_url":"https://github.com/thu-nics.png","language":"Python","funding_links":[],"categories":["Python","A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression\n\u003cp align=\"center\"\u003e\n🌐 \u003ca href=\"https://thu-nics.github.io/MoA_project_page/\"\u003e\u003cb\u003eProject Page\u003c/b\u003e\u003c/a\u003e\u0026nbsp\u0026nbsp | \u0026nbsp\u0026nbsp📑 \u003ca href=\"https://arxiv.org/abs/2406.14909\"\u003e\u003cb\u003earXiv\u003c/b\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003ctable width=\"100%\"\u003e\n\u003ctr\u003e\n  \u003c!-- Column for the image and text --\u003e\n  \u003ctd width=\"60%\" valign=\"top\"\u003e\n    \u003cimg src=\"https://github.com/thu-nics/MoA_project_page/blob/master/static/images/workflow.png?raw=true\" alt=\"Workflow Intuition\" style=\"width:100%;\"\u003e\n    \u003cp\u003eCompressing the attention operation is crucial for the efficiency of processing long inputs. Existing sparse attention methods (more specifically, local attention methods), such as StreamingLLM, adopt uniform and fixed attention masks across different attention heads. Nevertheless, some heads need to attend to more distant information than others; and as the input sequence gets longer, some heads might need to increase their span more than others. In this work, we propose MoA that overcomes the drawbacks of uniform sparse attention by searching heterogeneous elastic rules for each attention head using an automatic pipeline.\u003c/p\u003e\n  \u003c/td\u003e\n\n  \u003c!-- Column for the GIF --\u003e\n  \u003ctd width=\"40%\" valign=\"top\"\u003e\n    \u003cimg src=\"https://github.com/thu-nics/MoA_project_page/raw/master/static/images/moa_demo.gif\" alt=\"MoA Demo\" style=\"width:100%;\"\u003e\n  \u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nMoA achieves a 1.2-1.4x GPU memory reduction, boosting decode throughput by 6.6−8.2x and 1.7−1.9x compared to FlashAttention2 and vLLM, with minimal impact on performance.\n\nIf you find this repository or paper useful, you can cite\n```\n@article{fu2024moa,\n  title={Moa: Mixture of sparse attention for automatic large language model compression},\n  author={Fu, Tianyu and Huang, Haofeng and Ning, Xuefei and Zhang, Genghan and Chen, Boju and Wu, Tianqi and Wang, Hongyi and Huang, Zixiao and Li, Shiyao and Yan, Shengen and others},\n  journal={arXiv preprint arXiv:2406.14909},\n  year={2024}\n}\n```\n\n## News\n\n- [2024/10] MoA kernel is now available in [CUDA](https://github.com/thu-nics/MoA_Kernel), achieving faster inference speed.\n\n## Environment Setup\n\nFirst, create the Conda environment and install the relevant packages using the following commands:\n\n```bash\nconda create -n moa python=3.10\nconda activate moa\n\npip install -r requirements.txt\npip install -e .\n```\n\nThen, install the MoA kernel by following the instructions in the [MoA Kernel repository](https://github.com/thu-nics/MoA_Kernel).\n\n## Kind Notes\n\n### Cloning the Repository\n\nIf you have trouble cloning the repo, it is probably because the repo's git-lfs is too large. You can safely skip the downloading of git-lfs with `git clone --no-checkout \u003crepo_url\u003e`.\n\n### Group Query Attention Models\n\nIf you are testing the accuracy of group query attention models with our kernel, please convert them to multi head attention models before profiling and inference. You can do so by running the `scripts/helper/gqa_to_mha.py` script.\n\n## Quick Start: Use Pre-defined Plans\n\nIf you prefer not to perform the automatic compression plan search steps and want immediate results, we provide pre-compressed configurations for the `lmsys/vicuna-{size}-v1.5-16k` models (7B and 13B versions). These can be found in the `.json` files under the `examples` directory.\n\nYou can directly go to `Evaluation` section to evaluate the model with the plans. \nIf you want to compress other models, you can follow the `Automatic Search Pipeline` section to compress the model by yourself.\n\n## Automatic Search Pipeline\n\nThe pipeline automatically compresses the LLM by finding the optimal MoA configurations for each attention head and layer. The pipeline consists of four steps: calibration dataset generation, profile, optimize, and validate.\n\nTo run the entire pipeline with one line of code, use `scripts/pipeline/main.py`. For GQA models, add parameter `--is_gqa`. For the vicuna example:\n\n```bash\npython scripts/pipeline/main.py --model_path lmsys/vicuna-7b-v1.5-16k --model_name lmsys--vicuna-7b-v1.5-16k\n```\n\nAfter the pipeline completes, you can evaluate the model with the generated plans using the `Evaluation` section. If you want to understand the pipeline in detail, you can follow the below steps instead.\n\n### Calibration Dataset Generation\n\nMoA creates the calibration dataset with long dependency and model alignment. We publish the calibration dataset at [this HuggingFace Repository](https://huggingface.co/datasets/nics-efc/MoA_Long_HumanQA) with human-written answers. To ensure \"model alignment\", we should generate the model answers from the original dense LLM.\nThis involves querying an LLM with original questions to collect its responses, which are then formatted into a standard Hugging Face `Dataset` item.\n\n```bash\npython scripts/pipeline/generate_calibration_dataset.py --model_path lmsys/vicuna-7b-v1.5-16k --model_name vicuna-7b-v1.5-16k --output_path_base output/lmsys--vicuna-7b-v1.5-16k/dataset\n```\n\n### Profile\nMoA employs a gradient based method to quantify the importance of the attention values. The `--response_mask` option specifies that only the model's responses are used as supervision. Given the calibration dataset, the profile process outputs the average attention influence tensor at a specific sequence length.\n\n```bash\npython scripts/pipeline/pipeline_profile.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 2048 --response_mask --dataset_dir output/lmsys--vicuna-7b-v1.5-16k/dataset/multi_conversation_model/multi_news --grad_dir output/lmsys--vicuna-7b-v1.5-16k/profile/profile_2k\n\npython scripts/pipeline/pipeline_profile.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 4096 --response_mask --dataset_dir output/lmsys--vicuna-7b-v1.5-16k/dataset/multi_conversation_model/multi_news --grad_dir output/lmsys--vicuna-7b-v1.5-16k/profile/profile_4k\n\npython scripts/pipeline/pipeline_profile.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 8192 --response_mask --dataset_dir output/lmsys--vicuna-7b-v1.5-16k/dataset/multi_conversation_model/multi_news --grad_dir output/lmsys--vicuna-7b-v1.5-16k/profile/profile_8k\n```\n\n### Optimize\n\nMoA identifies Pareto front compression plans to  minimize accuracy losses across various sequence lengths under density budget. The `--elastic_length` option specifies the sequence lengths for which profile are done, `--extend_length` determines the maximum length which we wish the compression plan to extend to, and `--density_bounds` sets the maximum allowable attention density for each length.\n\n```bash\npython scripts/pipeline/elastic_generate.py --output_dir output/lmsys--vicuna-7b-v1.5-16k/optimize --elastic_length 2048 4096 8192 --extend_length 16384 --density_bounds 0.5 0.5 0.5 0.5 --importance_tensor_dir output/lmsys--vicuna-7b-v1.5-16k/profile/ --output_length 4096 8192 12288 16384\n```\n\nYou can set `--time_limit num` to specify the maximum duration (in seconds) for each single objective optimization. Also you might need to apply for the gurobi license on the [official website](https://www.gurobi.com/) to use the optimization library.\n\n### Validate\n\nMoA selects the plan that yields minimum loss at unseen length among the Pareto front plans.\n\nTo evaluate the loss of a certain plan on a specified length level, use the following command, replacing `{i}` with the actual plan ID:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python scripts/pipeline/perplexity_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 12288 --dataset_dir nics-efc/MoA_Long_HumanQA --split valid --response_mask --moa_config output/lmsys--vicuna-7b-v1.5-16k/optimize/moa_config_plan_{i}.json  --result_path output/lmsys--vicuna-7b-v1.5-16k/validate/validate_0.csv\n```\n\nAlternatively, to evaluate all plans within a directory, run the following script:\n\n```bash\nscripts/pipeline/validate.sh \u003cmoa_config_dir\u003e \u003cmoa_config_num\u003e \u003cresult_dir\u003e \u003cmodel_name\u003e\n```\n\nFor example\n\n```bash\nscripts/pipeline/validate.sh output/lmsys--vicuna-7b-v1.5-16k/optimize/ \u003cplan_num\u003e output/lmsys--vicuna-7b-v1.5-16k/validate lmsys/vicuna-7b-v1.5-16k\n```\n\nReplace \u003cplan_num\u003e with the number of plans under the directory.\n\n## Evaluation\n\nWe provide the example compression plans under the `examples` directory. You can use them by setting the following `--moa_config` to the `.json` files under the directory.\n\n### Apply MoA to LLM\n\nGiven the compression plan found by MoA, you can simply apply the plan to the model with few lines. \n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\nfrom MoA.models.interface import update_model_function\n\n# Load the huggingface model\nmodel_name = \"lmsys/vicuna-7b-v1.5-16k\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\nmoa_config_path = \"examples/lmsys--vicuna-7b-v1.5-16k/moa_alpha_beta.json\"\nwith open(moa_config_path, 'r') as f:\n    moa_config = json.load(f)\n# Add mixture of sparse attention capability to the model\nmodel = update_model_function(model, model_name)\nmodel.model.set_mixture_of_attention(moa_config, permute_head=True)\n\n# Now you can use the `model` for efficient inference like any regular huggingface model\n# For example, you can use it in pipeline to chat with the model\npipe = pipeline(task=\"text-generation\", tokenizer=tokenizer, model=model, trust_remote_code=True)\nprompt = \"Hi.\"\noutput = pipe(prompt)\n```\n\n### Retrieval\n\nMoA aims to preserve the retrieval ability of the original dense model with a reduced impact on accuracy. To evaluate the retrieval performance of a specific plan at a given input length, use the following command, replacing `{i}` with the actual plan ID:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python scripts/evaluate/retrieval_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --moa_config output/lmsys--vicuna-7b-v1.5-16k/optimize/moa_config_plan_{i}.json --output_dir output/lmsys--vicuna-7b-v1.5-16k/evaluate/retrieval --length_level 8\n```\n\n\u003e Alternatively, you can use our example plans. When passing in multiple plans at different lengths, the correct length will be automatically selected according to the input length:\n\u003e \n\u003e ```bash\n\u003e CUDA_VISIBLE_DEVICES=0 python scripts/evaluate/retrieval_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --moa_config examples/lmsys--vicuna-7b-v1.5-16k/moa_alpha_beta.json --output_dir output/lmsys--vicuna-7b-v1.5-16k/evaluate/retrieval --length_level 8\n\u003e ```\n\n### LongBench\n\nMoA strives to maintain the long-context understanding ability of the original dense model. To assess this capability using the [LongBench benchmark](https://github.com/THUDM/LongBench), execute the following command, substituting `{i}` with the actual plan ID:\n\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python scripts/evaluate/longbench_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 3500 --eval longbench_fast --longbench_e --longbench_result_dir output/lmsys--vicuna-7b-v1.5-16k/evaluate/longbench --longbench_length_range 0-4k --moa_config output/lmsys--vicuna-7b-v1.5-16k/optimize/moa_config_plan_{i}.json\n\nCUDA_VISIBLE_DEVICES=0 python scripts/evaluate/longbench_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 7500 --eval longbench_fast --longbench_e --longbench_result_dir output/lmsys--vicuna-7b-v1.5-16k/evaluate/longbench --longbench_length_range 4-8k --moa_config output/lmsys--vicuna-7b-v1.5-16k/optimize/moa_config_plan_{i}.json\n\nCUDA_VISIBLE_DEVICES=0 python scripts/evaluate/longbench_evaluate.py --model_name lmsys/vicuna-7b-v1.5-16k --max_length 15500 --eval longbench_fast --longbench_e --longbench_result_dir output/lmsys--vicuna-7b-v1.5-16k/evaluate/longbench --longbench_length_range 8k+ --moa_config output/lmsys--vicuna-7b-v1.5-16k/optimize/moa_config_plan_{i}.json\n```\n\n\u003e Alternatively, you can use our example plans.\n\n### Chat Demo\n\nTo chat with the model using the example plans, run the following command:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python scripts/evaluate/chat_demo.py --model_name lmsys/vicuna-7b-v1.5-16k --moa_config examples/lmsys--vicuna-7b-v1.5-16k/moa_alpha_beta.json --batch_size 16\n```\n\n\u003e Currently, the input prompt should have at least 64 tokens.\n\n## TODOs\n\n- [ ] Support padding in batch inference\n\n- [ ] Support prefill with past_key_values (use Key-Value cache in multi-round conversation)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-nics%2FMoA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-nics%2FMoA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-nics%2FMoA/lists"}