{"id":15012276,"url":"https://github.com/microsoft/minference","last_synced_at":"2025-05-14T12:11:49.486Z","repository":{"id":246786900,"uuid":"804362023","full_name":"microsoft/MInference","owner":"microsoft","description":"[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.","archived":false,"fork":false,"pushed_at":"2025-05-05T06:38:02.000Z","size":9582,"stargazers_count":1005,"open_issues_count":58,"forks_count":51,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-05-06T13:43:29.289Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://aka.ms/MInference","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/microsoft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-22T12:59:47.000Z","updated_at":"2025-05-06T07:30:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"c4634aeb-cfe1-4588-8bc2-1710aa7708ad","html_url":"https://github.com/microsoft/MInference","commit_stats":{"total_commits":125,"total_committers":4,"mean_commits":31.25,"dds":"0.32799999999999996","last_synced_commit":"cb7bdd3ff1613525f3c056fc6e2f4becd2516fa2"},"previous_names":["microsoft/minference"],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FMInference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FMInference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FMInference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/microsoft%2FMInference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/microsoft","download_url":"https://codeload.github.com/microsoft/MInference/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140768,"owners_count":22021220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T19:42:22.002Z","updated_at":"2025-05-14T12:11:49.473Z","avatar_url":"https://github.com/microsoft.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003cimg alt=\"MInference\" src=\"https://raw.githubusercontent.com/microsoft/MInference/main/images/MInference_logo.png\" width=70%\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch2 align=\"center\"\u003eMInference: Million-Tokens Prompt Inference for Long-context LLMs\u003c/h2\u003e\n\n\u003cp align=\"center\"\u003e\n    | \u003ca href=\"https://aka.ms/MInference\"\u003e\u003cb\u003eProject Page\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://arxiv.org/abs/2407.02490\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://huggingface.co/spaces/microsoft/MInference\"\u003e\u003cb\u003eHF Demo\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://aka.ms/SCBench\"\u003e\u003cb\u003eSCBench\u003c/b\u003e\u003c/a\u003e |\n    \u003ca href=\"https://aka.ms/MMInference\"\u003e\u003cb\u003eMMInference\u003c/b\u003e\u003c/a\u003e |\n\u003c/p\u003e\n\nhttps://github.com/microsoft/MInference/assets/30883354/52613efc-738f-4081-8367-7123c81d6b19\n\n_Now, you can process **1M context 10x faster in a single A100** using Long-context LLMs like LLaMA-3-8B-1M, GLM-4-1M, with even **better accuracy**, try **MInference 1.0** right now!_\n\n## 📰 News\n- 🐝 [25/05/02] MMInference has been accepted at **ICML'25**.\n- 👨‍💻‍ [25/04/14] [SGLang](https://github.com/sgl-project/sglang/pull/5327) and [vLLM](https://github.com/vllm-project/flash-attention/pull/33) have merged the MInference sparse attention kernel. Notably, SGLang also adapted it for FlashAttention-3. Special thanks to @zhyncs and @yinfan98 for their contributions!\n- 👾 [25/04/23] We are excited to announce the release of our multi-modality work, [MMInference](https://aka.ms/2504.16083), which use **modality-aware permutation sparse attention** to accelerate long-context VLMs. We'll present MMInference at **Microsoft Booth** and **FW-Wild at ICLR'25**. See you in Singapore!\n- 🤗 [25/01/27] MInference has been integrated into [Qwen2.5-1M](https://qwenlm.github.io/blog/qwen2.5-1m/) and online services. For details, refer to the [paper](https://arxiv.org/abs/2501.15383) and the [vLLM implementation](https://github.com/vllm-project/vllm/pull/11844).\n- 🪸 [25/01/23] SCBench has been accepted at **ICLR'25**.\n\u003cdetails\u003e\n\u003csummary\u003eMore News\u003c/summary\u003e\n \u003cul\u003e\n  \u003cli\u003e 🍩 [24/12/13] We are excited to announce the release of our KV cache-centric analysis work, \u003ca href=\"https://aka.ms/SCBench\"\u003eSCBench\u003c/a\u003e, which evaluates long-context methods from a KV cache perspective.\u003c/li\u003e\n  \u003cli\u003e 🧤 [24/09/26] MInference has been accepted as \u003cb\u003espotlight\u003c/b\u003e at \u003cb\u003eNeurIPS'24\u003c/b\u003e. See you in Vancouver!\u003c/li\u003e\n  \u003cli\u003e 👘 [24/09/16] We are pleased to announce the release of our KV cache offloading work, \u003ca href=\"https://aka.ms/RetrievalAttention\"\u003eRetrievalAttention\u003c/a\u003e, which accelerates long-context LLM inference via vector retrieval.\u003c/li\u003e\n  \u003cli\u003e 🥤 [24/07/24] MInference supports \u003ca href=\"https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct\"\u003emeta-llama/Meta-Llama-3.1-8B-Instruct\u003c/a\u003e now.\u003c/li\u003e\n  \u003cli\u003e 🪗 [24/07/07] Thanks @AK for sponsoring. You can now use MInference online in the \u003ca href=\"https://huggingface.co/spaces/microsoft/MInference\"\u003eHF Demo\u003c/a\u003e with ZeroGPU.\u003c/li\u003e\n  \u003cli\u003e 📃 [24/07/03] Due to an issue with arXiv, the PDF is currently unavailable there. You can find the paper at this \u003ca href=\"https://export.arxiv.org/pdf/2407.02490\"\u003elink\u003c/a\u003e.\u003c/li\u003e\n  \u003cli\u003e 🧩 [24/07/03] We will present \u003cb\u003eMInference 1.0\u003c/b\u003e at the \u003cb\u003e\u003ci\u003eMicrosoft Booth\u003c/i\u003e\u003c/b\u003e and \u003cb\u003e\u003ci\u003eES-FoMo\u003c/i\u003e\u003c/b\u003e at ICML'24. See you in Vienna!\u003c/li\u003e\n\u003c/ul\u003e\n\u003c/details\u003e\n\n## TL;DR\n\n**MInference 1.0** leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for long-context LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a **10x speedup** for pre-filling on an A100 while maintaining accuracy.\n\n- [MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention](https://arxiv.org/abs/2407.02490) (NeurIPS'24 **spotlight**, ES-FoMo @ ICML'24)\u003cbr\u003e\n  _Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu_\n\n**SCBench** analyzes long-context methods from a **KV cache-centric perspective** across the full KV cache lifecycle (e.g., KV cache generation, compression, retrieval, and loading). It evaluates 12 tasks under two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task scenarios.\n\n- [SCBench: A KV Cache-Centric Analysis of Long-Context Methods](https://arxiv.org/abs/2412.10319) (ICLR'25, ENLSP @ NeurIPS'24)\u003cbr\u003e\n  _Yucheng Li, Huiqiang Jiang, Qianhui Wu, Xufang Luo, Surin Ahn, Chengruidong Zhang, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang and Lili Qiu_\n\n**MMInference** use **modality-aware permutation sparse attention** to accelerate long-context VLMs inference in prefilling-stage. Specifically, we implement three distinct permutation-based sparse attention mechanisms, with FlashAttention, FlashDecoding and PIT, to address the grid patterns in vision inputs and the modality boundary issues in mixed-modality scenarios.\n\n- [MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention](https://arxiv.org/abs/2504.16083) (ICML'25, FM-Wild @ ICLR'25)\u003cbr\u003e\n  _Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang and Lili Qiu_\n\n\n## 🎥 Overview\n\n![Onepage of MInference](https://raw.githubusercontent.com/microsoft/MInference/main/images/MInference1_onepage.png)\n![Onepage of SCBench](https://raw.githubusercontent.com/microsoft/MInference/main/images/SCBench_onepage.png)\n![Onepage of MMInference](https://raw.githubusercontent.com/microsoft/MInference/main/images/MMInference_onepage.png)\n\n## 🎯 Quick Start\n\n### Requirements\n\n- Torch\n- FlashAttention-2 (Optional)\n- Triton\n- **Transformers \u003e= 4.46.0**\n\nTo get started with MInference, simply install it using pip:\n\n```bash\npip install minference\n```\n\n### Supported Efficient Methods\n\nYou can get the complete list of supported efficient methods by running the following code:\n```python\nfrom minference import MInferenceConfig\nsupported_attn_types = MInferenceConfig.get_available_attn_types()\nsupported_kv_types = MInferenceConfig.get_available_kv_types()\n```\n\nCurrently, we support the following long-context methods:\n\n- **[① KV Cache Generation]:** [MInference](https://arxiv.org/abs/2407.02490), [xAttention](https://arxiv.org/abs/2503.16428), [FlexPrefill](https://arxiv.org/abs/2502.20766), [A-shape](https://arxiv.org/abs/2309.17453), [Tri-shape](https://arxiv.org/abs/2412.10319), [MInference w/ static](https://arxiv.org/abs/2407.02490), [Dilated](https://arxiv.org/abs/2004.05150), [Strided](https://arxiv.org/abs/1904.10509)\n- **[② KV Cache Compression]:** [StreamingLLM](https://arxiv.org/abs/2309.17453), [SnapKV](https://arxiv.org/abs/2404.14469), [PyramidKV](https://arxiv.org/abs/2406.02069), [KIVI](https://arxiv.org/abs/2402.02750)\n- **[③ KV Cache Retrieval]:** [CacheBlend](https://arxiv.org/abs/2405.16444)\n- **[④ KV Cache Loading]:** [Quest](https://arxiv.org/abs/2406.10774), [RetrievalAttention](https://arxiv.org/abs/2409.10516)\n\nFor more details about the KV cache lifecycle, please refer to [**SCBench**](https://arxiv.org/abs/2412.10319). Note that some modes are supported by vLLM, while all modes are supported by HF.\n\n### Supported Models\n\nGeneral *MInference* **supports any decoding LLMs**, including LLaMA-style models, and Phi models.\nWe have adapted nearly all open-source long-context LLMs available in the market.\nIf your model is not on the supported list, feel free to let us know in the issues, or you can follow [the guide](https://github.com/microsoft/MInference/blob/main/experiments) to manually generate the sparse heads config.\n\nYou can get the complete list of supported LLMs by running:\n```python\nfrom minference import get_support_models\nget_support_models()\n```\n\nCurrently, we support the following LLMs:\n- Qwen2.5: [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), [Qwen/Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct), [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct), [Qwen/Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M), [Qwen/Qwen2.5-14B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-1M)\n- LLaMA-3.1: [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct), [meta-llama/Meta-Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct)\n- LLaMA-3: [gradientai/Llama-3-8B-Instruct-262k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-262k), [gradientai/Llama-3-8B-Instruct-Gradient-1048k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-1048k), [gradientai/Llama-3-8B-Instruct-Gradient-4194k](https://huggingface.co/gradientai/Llama-3-8B-Instruct-Gradient-4194k), [gradientai/Llama-3-70B-Instruct-Gradient-262k](https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-262k), [gradientai/Llama-3-70B-Instruct-Gradient-1048k](https://huggingface.co/gradientai/Llama-3-70B-Instruct-Gradient-1048k)\n- GLM-4: [THUDM/glm-4-9b-chat-1m](https://huggingface.co/THUDM/glm-4-9b-chat-1m)\n- Yi: [01-ai/Yi-9B-200K](https://huggingface.co/01-ai/Yi-9B-200K)\n- Phi-3: [microsoft/Phi-3-mini-128k-instruct](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)\n- Qwen2: [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)\n\n### How to use MInference\n\nfor HF,\n```diff\nfrom transformers import pipeline\n+from minference import MInference\n\npipe = pipeline(\"text-generation\", model=model_name, torch_dtype=\"auto\", device_map=\"auto\")\n\n# Patch MInference Module,\n# If you use the local path, please use the model_name from HF when initializing MInference.\n+minference_patch = MInference(\"minference\", model_name)\n+pipe.model = minference_patch(pipe.model)\n\npipe(prompt, max_length=10)\n\n# Using sparse kv methods, e.g. snapkv, quest, retr_attn, kivi\n+minference_patch = MInference(attn_type=\"minference\", model_name=model_name, kv_type=\"quest\")\n+pipe.model = minference_patch(pipe.model)\n\npipe(prompt, max_length=10)\n```\n\nfor vLLM,\n\u003e For now, please use vllm\u003e=0.4.1\n\n```diff\nfrom vllm import LLM, SamplingParams\n+ from minference import MInference\n\nllm = LLM(model_name, enforce_eager=True, max_model_len=128_000, enable_chunked_prefill=False)\n\n# Patch MInference Module,\n# If you use the local path, please use the model_name from HF when initializing MInference.\n+minference_patch = MInference(\"vllm\", model_name)\n+llm = minference_patch(llm)\n\noutputs = llm.generate(prompts, sampling_params)\n```\n\nfor vLLM w/ TP,\n\n1. Copy `minference_patch_vllm_tp` and `minference_patch_vllm_executor` from `minference/patch.py` to the end of the `Worker` class in `vllm/worker/worker.py`. Make sure to indent `minference_patch_vllm_tp`.\n2. When calling VLLM, ensure `enable_chunked_prefill=False` is set.\n3. Refer to the script in https://github.com/microsoft/MInference/blob/main/experiments/benchmarks/run_e2e_vllm_tp.sh\n\n```diff\nfrom vllm import LLM, SamplingParams\n+ from minference import MInference\n\nllm = LLM(model_name, enforce_eager=True, max_model_len=128_000, enable_chunked_prefill=False, tensor_parallel_size=2)\n\n# Patch MInference Module,\n# If you use the local path, please use the model_name from HF when initializing MInference.\n+minference_patch = MInference(\"vllm\", model_name)\n+llm = minference_patch(llm)\n\noutputs = llm.generate(prompts, sampling_params)\n```\n\nusing only the kernel,\n```python\nfrom minference import vertical_slash_sparse_attention, block_sparse_attention, streaming_forward\n\nattn_output = vertical_slash_sparse_attention(q, k, v, vertical_topk, slash)\nattn_output = block_sparse_attention(q, k, v, topk)\nattn_output = streaming_forward(q, k, v, init_num, local_window_num)\n```\n\nfor a local gradio demo \u003ca href='https://github.com/gradio-app/gradio'\u003e\u003cimg src='https://img.shields.io/github/stars/gradio-app/gradio'\u003e\u003c/a\u003e\n\n```bash\ngit clone https://huggingface.co/spaces/microsoft/MInference\ncd MInference\npip install -r requirments.txt\npip install flash_attn\npython app.py\n```\n\nFor more details, please refer to our [Examples](https://github.com/microsoft/MInference/tree/main/examples) and [Experiments](https://github.com/microsoft/MInference/tree/main/experiments). You can find more information about the dynamic compiler PIT in this [paper](https://dl.acm.org/doi/10.1145/3600006.3613139) and on [GitHub](https://github.com/microsoft/SparTA/tree/pit_artifact).\n\n## SCBench\n\n\u003e [!Note]\n\u003e - **datasets \u003e= 2.15.0**\n\n### Load Data\nYou can download and load the **SCBench** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/microsoft/SCBench)):\n```python\nfrom datasets import load_dataset\n\ndatasets = [\"scbench_kv\", \"scbench_prefix_suffix\", \"scbench_vt\", \"scbench_repoqa\", \"scbench_qa_eng\", \"scbench_qa_chn\", \"scbench_choice_eng\", \"scbench_many_shot\", \"scbench_summary\", \"scbench_mf\", \"scbench_summary_with_needles\", \"scbench_repoqa_and_kv\"]\n\nfor dataset in datasets:\n    data = load_dataset(\"microsoft/SCBench\", dataset, split=\"test\")\n```\n\n### Data Format\n\nAll data in **SCBench** are standardized to the following format:\n\n```json\n{\n    \"id\": \"Random id for each piece of data.\",\n    \"context\": \"The long context required for the task, such as repo-code, long-document, and many-shot.\",\n    \"multi_turns\": [{\"input\": \"multi-turn question.\", \"answer\": \"multi-turn reference answer.\"}],\n}\n```\n\n### Experiments\n\nWe implement **Multi-Turn** and **Multi-Request** modes with HF and vLLM in [`GreedySearch`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1160) and [`GreedySearch_vllm`](https://github.com/microsoft/MInference/blob/yucheng/kvcompression/scbench/eval_utils.py#L1070) two class. Please refer the follow scripts to run the experiments.\n\nfor all methods,\n```bash\ncd scbench\n# Single-GPU, in Multi-Turn Mode\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn\n# Multi-GPU, in Multi-Turn Mode\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn\n# Multi-GPU, in Multi-Request Mode\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_all_tasks.sh meta-llama/Llama-3.1-8B-Instruct 2 scdq\n```\n\nfor single methods,\n```bash\ncd scbench\n# Single-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 1 multi-turn vllm dense\n# Multi-GPU, in Multi-Turn Mode, using attn_type: vllm, kv_type: dense\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 multi-turn vllm dense\n# Multi-GPU, in Multi-Request Mode, using attn_type: vllm, kv_type: dense\nVLLM_ALLOW_LONG_MAX_MODEL_LEN=1 CUDA_VISIBLE_DEVICES=0,1 VLLM_WORKER_MULTIPROC_METHOD=spawn bash scripts/run_single_method.sh meta-llama/Llama-3.1-8B-Instruct 2 scdq vllm dense\n```\n\nMore details about **attn_type** and **kv_type**, please refer to this section: [Supported Efficient Methods](https://github.com/microsoft/MInference/tree/main?tab=readme-ov-file#supported-efficient-methods).\n\n## FAQ\n\nFor more insights and answers, visit our [FAQ section](https://github.com/microsoft/MInference/blob/main/Transparency_FAQ.md).\n\n**Q1: How to effectively evaluate the impact of dynamic sparse attention on the capabilities of long-context LLMs?**\n\nTo evaluate long-context LLM capabilities using models like LLaMA-3-8B-Instruct-1M and GLM-4-9B-1M, we tested: 1) context window with RULER, 2) general tasks with InfiniteBench, 3) retrieval tasks with Needle in a Haystack, and 4) language model prediction with PG-19.\u003cbr/\u003e\nWe found traditional methods perform poorly in retrieval tasks, with difficulty levels as follows: \u003cfont color=\"#337ab7\"\u003e\u003cb\u003eKV retrieval \u003e Needle in a Haystack \u003e Retrieval.Number \u003e Retrieval PassKey\u003c/b\u003e\u003c/font\u003e. The main challenge is the semantic difference between needles and the haystack. Traditional methods excel when this difference is larger, as in passkey tasks. KV retrieval requires higher retrieval capabilities since any key can be a target, and multi-needle tasks are even more complex.\u003cbr/\u003e\nWe will continue to update our results with more models and datasets in future versions.\n\n**Q2: Does this dynamic sparse attention pattern only exist in long-context LLMs that are not fully trained?**\n\nFirstly, attention is dynamically sparse, a characteristic inherent to the mechanism. We selected state-of-the-art long-context LLMs, GLM-4-9B-1M and LLaMA-3-8B-Instruct-1M, with effective context windows of 64K and 16K. With MInference, these can be extended to 64K and 32K, respectively. We will continue to adapt our method to other advanced long-context LLMs and update our results, as well as explore the theoretical basis for this dynamic sparse attention pattern.\n\n**Q3: Does this dynamic sparse attention pattern only exist in Auto-regressive LMs or RoPE based LLMs?**\n\nSimilar vertical and slash line sparse patterns have been discovered in BERT[1] and multi-modal LLMs[2]. Our analysis of T5's attention patterns, shown in the figure, reveals these patterns persist across different heads, even in bidirectional attention.\u003cbr/\u003e\n[1] SparseBERT: Rethinking the Importance Analysis in Self-Attention, ICML 2021.\u003cbr/\u003e\n[2] LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference, 2024.\u003cbr/\u003e\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/microsoft/MInference/main/images/t5_sparse_pattern.png\" width=\"600px\" style=\"margin:auto;border-radius: 5px;display: inline-block;padding: 0 0 0 10px;\" alt=''\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eFigure 1. The sparse pattern in T5 Encoder.\u003c/p\u003e\n\n**Q4: What is the relationship between MInference, SSM, Linear Attention, and Sparse Attention?**\n\nAll four approaches (MInference, SSM, Linear Attention, and Sparse Attention) efficiently optimize attention complexity in Transformers, each introducing inductive bias differently. The latter three require training from scratch. Recent works like Mamba-2 and Unified Implicit Attention Representation unify SSM and Linear Attention as static sparse attention, with Mamba-2 itself being a block-wise sparse method. While these approaches show potential due to sparse redundancy in attention, static sparse attention may struggle with dynamic semantic associations in complex tasks. In contrast, dynamic sparse attention is better suited for managing these relationships.\n\n**Q5**: CUDA Out of Memory in in `_prepare_4d_causal_attention_mask_with_cache_position`\n\n_Solution_: Set the Hugging Face model's attention backend to FlashAttention-2 by adding the following argument during model initialization: `_attn_implementation=\"flash_attention_2\",`.\n\n**Q6**: CUDA Out of Memory in in `logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :])`\n\n_Solution_: Set the `num_logits_to_keep=1` in model forward.\n\n## Citation\n\nIf you find MInference useful or relevant to your project and research, please kindly cite our paper:\n\n```bibtex\n@inproceedings{jiang2024minference,\n  author = {Huiqiang Jiang and Yucheng Li and Chengruidong Zhang and Qianhui Wu and Xufang Luo and Surin Ahn and Zhenhua Han and Amir H. Abdi and Dongsheng Li and Chin-Yew Lin and Yuqing Yang and Lili Qiu},\n  booktitle = {The Thirty-eighth Annual Conference on Neural Information Processing Systems},\n  title = {{MI}nference 1.0: Accelerating Pre-filling for Long-Context {LLM}s via Dynamic Sparse Attention},\n  url = {https://openreview.net/forum?id=fPBACAbqSN},\n  year = {2024}\n}\n\n@inproceedings{li2025scbench,\n  title={{SCB}ench: A {KV} Cache-Centric Analysis of Long-Context Methods},\n  author={Yucheng Li and Huiqiang Jiang and Qianhui Wu and Xufang Luo and Surin Ahn and Chengruidong Zhang and Amir H. Abdi and Dongsheng Li and Jianfeng Gao and Yuqing Yang and Lili Qiu},\n  booktitle={The Thirteenth International Conference on Learning Representations},\n  year={2025},\n  url={https://openreview.net/forum?id=gkUyYcY1W9}\n}\n\n@inproceedings{li2025mminference,\n    title={{MMI}ference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention},\n    author={Li, Yucheng and Jiang, Huiqiang and Zhang, Chengruidong and Wu, Qianhui and Luo, Xufang and Ahn, Surin and Abdi, Amir H and Li, Dongsheng and Gao, Jianfeng and Yang, Yuqing and Qiu, Lili},\n    booktitle={Forty-second International Conference on Machine Learning},\n    year={2025},\n    url={https://openreview.net/forum?id=me6PfbATWM}\n}\n```\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft\ntrademarks or logos is subject to and must follow\n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","funding_links":[],"categories":["Inference","Inference \u0026 Serving"],"sub_categories":["Inference Engine","Inference Engines"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fminference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmicrosoft%2Fminference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmicrosoft%2Fminference/lists"}