{"id":13754067,"url":"https://github.com/hao-ai-lab/LookaheadDecoding","last_synced_at":"2025-05-09T22:30:51.951Z","repository":{"id":208496694,"uuid":"721748445","full_name":"hao-ai-lab/LookaheadDecoding","owner":"hao-ai-lab","description":"[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding","archived":false,"fork":false,"pushed_at":"2025-03-06T07:17:19.000Z","size":35092,"stargazers_count":1235,"open_issues_count":33,"forks_count":75,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-04-11T22:59:21.563Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2402.02057","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hao-ai-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-21T17:36:43.000Z","updated_at":"2025-04-09T22:30:08.000Z","dependencies_parsed_at":"2025-01-16T18:05:49.588Z","dependency_job_id":"afa183cd-377c-41d7-9f0a-d92515d4f972","html_url":"https://github.com/hao-ai-lab/LookaheadDecoding","commit_stats":{"total_commits":22,"total_committers":5,"mean_commits":4.4,"dds":0.2272727272727273,"last_synced_commit":"08fb2d8feec1798ec343204c20dde2b782ee8951"},"previous_names":["hao-ai-lab/lookaheaddecoding"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hao-ai-lab%2FLookaheadDecoding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hao-ai-lab%2FLookaheadDecoding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hao-ai-lab%2FLookaheadDecoding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hao-ai-lab%2FLookaheadDecoding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hao-ai-lab","download_url":"https://codeload.github.com/hao-ai-lab/LookaheadDecoding/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335292,"owners_count":21892643,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:38.401Z","updated_at":"2025-05-09T22:30:51.945Z","avatar_url":"https://github.com/hao-ai-lab.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003cdiv align=\"center\"\u003e\u003ch1\u003e\u0026nbsp;Break the Sequential Dependency of LLM Inference Using Lookahead Decoding\u003c/h1\u003e\u003c/div\u003e\r\n\r\n\u003cp align=\"center\"\u003e\r\n| \u003ca href=\"https://arxiv.org/abs/2402.02057\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://lmsys.org/blog/2023-11-21-lookahead-decoding/\"\u003e\u003cb\u003eBlog\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://github.com/hao-ai-lab/LookaheadDecoding/issues/13\"\u003e\u003cb\u003eRoadmap\u003c/b\u003e\u003c/a\u003e | \r\n\u003c/p\u003e\r\n\r\n---\r\n*News* 🔥\r\n- [2024/2] Lookahead Decoding Paper now available on [arXiv](https://arxiv.org/abs/2402.02057). [Sampling](#use-lookahead-decoding-in-your-own-code) and [FlashAttention](#flashAttention-support) are supported. Advanced features for better token prediction are updated.\r\n\r\n---\r\n## Introduction \r\nWe introduce lookahead decoding:\r\n- A parallel decoding algorithm to accelerate LLM inference.\r\n- Without the need for a draft model or a data store.\r\n- Linearly decreases #decoding steps relative to log(FLOPs) used per decoding step.\r\n\r\nBelow is a demo of lookahead decoding accelerating LLaMa-2-Chat 7B generation:\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cpicture\u003e\r\n  \u003cimg src=\"media/acc-demo.gif\" width=\"80%\"\u003e\r\n  \u003c/picture\u003e\r\n  \u003cbr\u003e\r\n  \u003cdiv align=\"center\" width=\"80%\"\u003e\r\n  \u003cem\u003eDemo of speedups by lookahead decoding on LLaMA-2-Chat 7B generation. Blue fonts are tokens generated in parallel in a decoding step.\u003c/em\u003e\r\n  \u003c/div\u003e\r\n  \u003cbr\u003e\r\n\u003c/div\u003e\r\n\r\n### Background: Parallel LLM Decoding Using Jacobi Iteration\r\n\r\nLookahead decoding is motivated by [Jacobi decoding](https://arxiv.org/pdf/2305.10427.pdf), which views autoregressive decoding as solving nonlinear systems and decodes all future tokens simultaneously using a fixed-point iteration method. Below is a Jacobi decoding example.\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cpicture\u003e\r\n  \u003cimg src=\"media/jacobi-iteration.gif\" width=\"80%\"\u003e\r\n  \u003c/picture\u003e\r\n  \u003cbr\u003e\r\n  \u003cdiv align=\"center\" width=\"80%\"\u003e\r\n  \u003cem\u003eIllustration of applying Jacobi iteration method for parallel LLM decoding.\u003c/em\u003e\r\n  \u003c/div\u003e\r\n  \u003cbr\u003e\r\n\u003c/div\u003e\r\n\r\nHowever, Jacobi decoding can barely see wall-clock speedup in real-world LLM applications.\r\n\r\n### Lookahead Decoding: Make Jacobi Decoding Feasible\r\n\r\nLookahead decoding takes advantage of Jacobi decoding's ability by collecting and caching n-grams generated from Jacobi iteration trajectories.\r\n\r\nThe following gif shows the process of collecting 2 grams via Jacobi decoding and verifying them to accelerate decoding.\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cpicture\u003e\r\n  \u003cimg src=\"media/lookahead-decoding.gif\" width=\"80%\"\u003e\r\n  \u003c/picture\u003e\r\n  \u003cbr\u003e\r\n  \u003cdiv align=\"center\" width=\"80%\"\u003e\r\n  \u003cem\u003eIllustration of lookahead decoding with 2-grams.\u003c/em\u003e\r\n  \u003c/div\u003e\r\n  \u003cbr\u003e\r\n\u003c/div\u003e\r\n\r\nTo enhance the efficiency of this process, each lookahead decoding step is divided into two parallel branches: the lookahead branch and the verification branch. The lookahead branch maintains a fixed-sized, 2D window to generate n-grams from the Jacobi iteration trajectory. Simultaneously, the verification branch selects and verifies promising n-gram candidates.\r\n\r\n### Lookahead Branch and Verification Branch\r\n\r\nThe lookahead branch aims to generate new N-grams. The branch operates with a two-dimensional window defined by two parameters:\r\n- Window size W: How far ahead we look in future token positions to conduct parallel decoding.\r\n- N-gram size N: How many steps we look back into the past Jacobi iteration trajectory to retrieve n-grams.\r\n\r\nIn the verification branch, we identify n-grams whose first token matches the last input token. This is determined via simple string match. Once identified, these n-grams are appended to the current input and subjected to verification via an LLM forward pass through them.\r\n\r\nWe implement these branches in one attention mask to further utilize GPU's parallel computing power.\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cpicture\u003e\r\n  \u003cimg src=\"media/mask.png\" width=\"40%\"\u003e\r\n  \u003c/picture\u003e\r\n  \u003cbr\u003e\r\n  \u003cdiv align=\"center\" width=\"80%\"\u003e\r\n  \u003cem\u003eAttention mask for lookahead decoding with 4-grams and window size 5. In this mask, two 4-gram candidates (bottom right) are verified concurrently with parallel decoding.\u003c/em\u003e\r\n  \u003c/div\u003e\r\n  \u003cbr\u003e\r\n\u003c/div\u003e\r\n\r\n### Experimental Results\r\n\r\nOur study shows lookahead decoding substantially reduces latency, ranging from 1.5x to 2.3x on different datasets on a single GPU. See the figure below.\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n  \u003cpicture\u003e\r\n  \u003cimg src=\"media/lookahead-perf.png\" width=\"80%\"\u003e\r\n  \u003c/picture\u003e\r\n  \u003cbr\u003e\r\n  \u003cdiv align=\"center\" width=\"80%\"\u003e\r\n  \u003cem\u003eSpeedup of lookahead decoding on different models and datasets.\u003c/em\u003e\r\n  \u003c/div\u003e\r\n  \u003cbr\u003e\r\n\u003c/div\u003e\r\n\r\n## Contents\r\n- [Introduction](#introduction)\r\n- [Contents](#contents)\r\n- [Installation](#installation)\r\n  - [Install With Pip](#install-with-pip)\r\n  - [Install From The Source](#install-from-the-source)\r\n  - [Inference](#inference-with-lookahead-decoding)\r\n  - [Use In Your Own Code](#use-lookahead-decoding-in-your-own-code)\r\n- [Citation](#citation)\r\n- [Guidance](#guidance)\r\n\r\n\r\n## Installation\r\n### Install with pip\r\n```bash\r\npip install lade\r\n```\r\n### Install from the source\r\n```bash\r\ngit clone https://github.com/hao-ai-lab/LookaheadDecoding.git\r\ncd LookaheadDecoding\r\npip install -r requirements.txt\r\npip install -e .\r\n```\r\n\r\n### Inference With Lookahead decoding\r\nYou can run the minimal example to see the speedup that Lookahead decoding brings.\r\n```bash\r\npython minimal.py #no Lookahead decoding\r\nUSE_LADE=1 LOAD_LADE=1 python minimal.py #use Lookahead decoding, 1.6x speedup\r\n```\r\n\r\nYou can also enjoy chatting with your own chatbots with Lookahead decoding.\r\n```bash\r\nUSE_LADE=1 python applications/chatbot.py  --model_path meta-llama/Llama-2-7b-chat-hf --debug --chat #chat, with lookahead \r\nUSE_LADE=0 python applications/chatbot.py  --model_path meta-llama/Llama-2-7b-chat-hf --debug --chat #chat, without lookahead\r\n\r\n\r\nUSE_LADE=1 python applications/chatbot.py  --model_path meta-llama/Llama-2-7b-chat-hf --debug #no chat, with lookahead\r\nUSE_LADE=0 python applications/chatbot.py  --model_path meta-llama/Llama-2-7b-chat-hf --debug #no chat, without lookahead\r\n```\r\n\r\n### Use Lookahead decoding in your own code\r\nYou can import and use Lookahead decoding in your own code in three LoCs. You also need to set ```USE_LADE=1``` in command line or set ```os.environ[\"USE_LADE\"]=\"1\"``` in Python script. Note that Lookahead decoding only support LLaMA yet.\r\n\r\n```python\r\nimport lade\r\nlade.augment_all()\r\nlade.config_lade(LEVEL=5, WINDOW_SIZE=7, GUESS_SET_SIZE=7, DEBUG=0) \r\n#LEVEL, WINDOW_SIZE and GUESS_SET_SIZE are three important configurations (N,W,G) in lookahead decoding, please refer to our blog!\r\n#You can obtain a better performance by tuning LEVEL/WINDOW_SIZE/GUESS_SET_SIZE on your own device.\r\n```\r\n\r\nThen you can speedup the decoding process. Here is an example using greedy search:\r\n```\r\ntokenizer = AutoTokenizer.from_pretrained(model_name)\r\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device)\r\nmodel_inputs = tokenizer(input_text, return_tensors='pt').to(torch_device)\r\ngreedy_output = model.generate(**model_inputs, max_new_tokens=1024) #speedup obtained\r\n```\r\n\r\nHere is an example using sampling:\r\n```\r\ntokenizer = AutoTokenizer.from_pretrained(model_name)\r\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device)\r\nmodel_inputs = tokenizer(input_text, return_tensors='pt').to(torch_device)\r\nsample_output = model.generate(**model_inputs, max_new_tokens=1024, temperature=0.7) #speedup obtained\r\n```\r\n\r\n### FlashAttention Support\r\nInstall the original FlashAttention\r\n```bash\r\npip install flash-attn==2.3.3 #original FlashAttention\r\n```\r\nTwo ways to install FlashAttention specialized for Lookahead Decoding\r\n1) Download a pre-built package on https://github.com/Viol2000/flash-attention-lookahead/releases/tag/v2.3.3 and install (fast, recommended).\r\nFor example, I have cuda==11.8, python==3.9 and torch==2.1, I should do the following: \r\n```bash\r\nwget https://github.com/Viol2000/flash-attention-lookahead/releases/download/v2.3.3/flash_attn_lade-2.3.3+cu118torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl\r\npip install flash_attn_lade-2.3.3+cu118torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl\r\n```\r\n2) Install from the source (slow, not recommended)\r\n```bash\r\ngit clone https://github.com/Viol2000/flash-attention-lookahead.git\r\ncd flash-attention-lookahead \u0026\u0026 python setup.py install\r\n```\r\n\r\nHere is an example script to run the models with FlashAttention: \r\n```bash\r\npython minimal-flash.py #no Lookahead decoding, w/ FlashAttention\r\nUSE_LADE=1 LOAD_LADE=1 python minimal-flash.py #use Lookahead decoding, w/ FlashAttention, 20% speedup than w/o FlashAttention\r\n```\r\n\r\nIn your own code, you need to set ```USE_FLASH=True``` when calling ```config_lade```, and set ```attn_implementation=\"flash_attention_2\"``` when calling ```AutoModelForCausalLM.from_pretrained```.\r\n```python\r\nimport lade\r\nlade.augment_all()\r\nlade.config_lade(LEVEL=5, WINDOW_SIZE=7, GUESS_SET_SIZE=7, USE_FLASH=True, DEBUG=0) \r\ntokenizer = AutoTokenizer.from_pretrained(model_name)\r\nmodel = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map=torch_device, attn_implementation=\"flash_attention_2\")\r\nmodel_inputs = tokenizer(input_text, return_tensors='pt').to(torch_device)\r\ngreedy_output = model.generate(**model_inputs, max_new_tokens=1024) #speedup obtained\r\n```\r\nWe will integrate FlashAttention directly into this repo for simple installation and usage.\r\n\r\n## Citation\r\n```bibtex\r\n@article{fu2024break,\r\n  title={Break the sequential dependency of llm inference using lookahead decoding},\r\n  author={Fu, Yichao and Bailis, Peter and Stoica, Ion and Zhang, Hao},\r\n  journal={arXiv preprint arXiv:2402.02057},\r\n  year={2024}\r\n}\r\n```\r\n## Guidance\r\nThe core implementation is in decoding.py. Lookahead decoding requires an adaptation for each specific model. An example is in models/llama.py.\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhao-ai-lab%2FLookaheadDecoding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhao-ai-lab%2FLookaheadDecoding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhao-ai-lab%2FLookaheadDecoding/lists"}