{"id":20298128,"url":"https://github.com/FasterDecoding/REST","last_synced_at":"2025-05-07T20:34:27.181Z","repository":{"id":207294202,"uuid":"718875034","full_name":"FasterDecoding/REST","owner":"FasterDecoding","description":"REST: Retrieval-Based Speculative Decoding, NAACL 2024","archived":false,"fork":false,"pushed_at":"2024-12-02T02:39:17.000Z","size":1108,"stargazers_count":199,"open_issues_count":10,"forks_count":12,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-12T10:59:37.261Z","etag":null,"topics":["llm-inference","retrieval","speculative-decoding"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FasterDecoding.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-15T00:59:25.000Z","updated_at":"2025-04-12T07:48:03.000Z","dependencies_parsed_at":"2023-11-17T03:25:22.088Z","dependency_job_id":"428ed93c-3eed-4bda-9b73-5640d01e7f08","html_url":"https://github.com/FasterDecoding/REST","commit_stats":null,"previous_names":["fasterdecoding/rest"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FREST","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FREST/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FREST/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FasterDecoding%2FREST/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FasterDecoding","download_url":"https://codeload.github.com/FasterDecoding/REST/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252953717,"owners_count":21830890,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm-inference","retrieval","speculative-decoding"],"created_at":"2024-11-14T16:02:14.002Z","updated_at":"2025-05-07T20:34:27.153Z","avatar_url":"https://github.com/FasterDecoding.png","language":"C","funding_links":[],"categories":["A01_文本生成_文本对话","C"],"sub_categories":["大语言对话模型及数据"],"readme":"# REST: Retrieval-Based Speculative Decoding\n\n***If training's got you in a stew, take a REST and speed right through.***\n\n[[Paper](https://arxiv.org/abs/2311.08252)] [[Blog](https://sites.google.com/view/rest-llm/)]\n\n## News\n🎉 2024-3-14: REST is accepted to **NAACL 2024**!\n\n## Introduction\n\nREST is a retrieval-based speculative decoding method designed to boost generation speed of LLMs. Instead of relying on a draft language model like speculative decoding, REST utilizes a datastore to retrieve and employ draft tokens. Moreover, REST differs from blockwise parallel decoding and Medusa in that it doesn't require extra training steps. It functions as a plug-and-play solution capable of **accelerating any pre-existing language model**.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cpicture\u003e\n  \u003cimg src=\"assets/rest_overview.png\" width=\"85%\"\u003e\n  \u003c/picture\u003e\n  \u003cbr\u003e\n  \u003cdiv align=\"left\" width=\"80%\"\u003e\n  \u003cem\u003eOverview of REST. During inference, the input context is utilized as the query to retrieve docs from the datastore that match the longest suffix of the input. A Trie is constructed using the continuations from the retrieved docs and low-frequency branches are pruned. Candidates from the pruned subtree will be further fed into the LLM with a tree attention mask for verification. All correct tokens from the start will be accepted, and the draft tokens after the first mistake will be rejected.\u003c/em\u003e\n  \u003c/div\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cpicture\u003e\n  \u003cimg src=\"assets/rest_results.png\" width=\"85%\"\u003e\n  \u003c/picture\u003e\n  \u003cbr\u003e\n  \u003cdiv align=\"left\" width=\"40%\"\u003e\n  \u003cem\u003eSpeed on HumanEval and MT-Bench with standard autoregressive generation and REST. The temperature is set to 0.8 and the top-p to 0.95 for nucleus sampling in HumanEval. For MT-Bench, the settings are 0.7 for temperature and 0.8 for top-p. All the experiments are conducted on a single NVIDIA A6000 GPU and 96 CPU cores with a batch size of 1.\u003c/em\u003e\n  \u003c/div\u003e\n  \u003cbr\u003e\n\u003c/div\u003e\n\n\u003c!-- \u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n  \u003cimg src=\"assets/rest_results.png\" width=\"55%\"\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e --\u003e\n\n## Contents\n- [Introduction](#introduction)\n- [Contents](#contents)\n- [Installation](#installation)\n- [Build datastores](#Build-datastore)\n  - [Build a small one](#Build-a-small-one)\n  - [Build a large one](#Build-a-large-one)\n- [Inference](#Inference)\n  - [Inference on MT-Bench](#Inference-on-MT-Bench)\n  - [Inference on HumanEval](#Inference-on-HumanEval)\n  - [Free Chat](#Free-Chat)\n- [Citation](#citation)\n- [Other Models and Datastore](#other-models-and-datastore)\n- [Acknowledgements](#acknowledgements)\n\n## Installation\n```bash\nconda create -n rest python=3.9\nconda activate rest\npip3 install -r requirements.txt # pay attention to Pytorch CUDA version\npip3 install DraftRetriever/wheels/draftretriever-0.1.0-cp39-cp39-manylinux_2_34_x86_64.whl\n```\n\n## Build datastore\n\n### Build a small one\nBuild a chat datastore using data from [ShareGPT](https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered) within 10 minutes (requires 465MB disk storage)\n```bash\ncd datastore\npython3 get_datastore_chat.py --model-path lmsys/vicuna-7b-v1.5 # get datastore_chat_small.idx in this folder\n```\nBuild a Python code generation datastore from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) within 20 minutes (requires 924MB disk storage)\n```bash\ncd datastore\npython3 get_datastore_code.py --model-path codellama/CodeLlama-7b-instruct-hf # get datastore_stack_small.idx in this folder\n```\n\n### Build a large one\n(optionally) Build a chat datastore using data from [UltraChat](https://huggingface.co/datasets/stingning/ultrachat) (requires 12GB disk storage)\n```bash\ncd datastore\npython3 get_datastore_chat.py --model-path lmsys/vicuna-7b-v1.5 --large-datastore True # get datastore_chat_large.idx in  this folder\n```\n(optionally) Build a Python code generation datastore from [The Stack](https://huggingface.co/datasets/bigcode/the-stack) (requires 27GB disk storage)\n```bash\ncd datastore\npython3 get_datastore_code.py --model-path codellama/CodeLlama-7b-instruct-hf --large-datastore True # get datastore_stack_large.idx in this folder\n```\n\n## Inference\n\n### Inference on MT-Bench\n```bash\ncd llm_judge\nRAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 gen_model_answer_rest.py --model-path lmsys/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5 --datastore-path ../datastore/datastore_chat_small.idx\n```\n\n### Inference on HumanEval\n```bash\ncd human_eval\nRAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 rest_test.py --model-path codellama/CodeLlama-7b-instruct-hf --datastore-path ../datastore/datastore_stack_small.idx\n```\n\n### Free Chat\n```bash\nRAYON_NUM_THREADS=6 CUDA_VISIBLE_DEVICES=0 python3 -m rest.inference.cli --datastore-path datastore/datastore_chat_small.idx --base-model lmsys/vicuna-7b-v1.5\n```\n\nNote that the RAYON_NUM_THREADS environment variable control the maximum number of threads for retrieval. You can adjust it based on your machine.\n\n\n## Other Models and Datastore\nIn the examples above, we default to use Vicuna and CodeLlama. But actually you can use any LLaMA-based models you like by simply changing the \"--model-path\" argument. You can also build the datastore from any data you like. If you want to use architectures other than LLaMA, you can also modify the file model/modeling_llama_kv.py to match the corresponding model.\n\nNote: For models with a vocab size larger than 65535 (range of u16), you may change [this line in writer](https://github.com/FasterDecoding/REST/blob/main/DraftRetriever/src/lib.rs#L117) from `self.index_file.write_u16::\u003cLittleEndian\u003e(item as u16)?;` to `self.index_file.write_u32::\u003cLittleEndian\u003e(item as u32)?;`\nBesides, change [these two lines in Reader](https://github.com/FasterDecoding/REST/blob/main/DraftRetriever/src/lib.rs#L191-L192) from `for i in (0..data_u8.len()).step_by(2) { let int = LittleEndian::read_u16(\u0026data_u8[i..i+2]) as i32;` to `for i in (0..data_u8.len()).step_by(4) { let int = LittleEndian::read_u32(\u0026data_u8[i..i+4]) as i32;` (Fixed by [scandukuri](https://github.com/FasterDecoding/REST/pull/23))\n\n## Citation\n```\n@misc{he2023rest,\n      title={REST: Retrieval-Based Speculative Decoding}, \n      author={Zhenyu He and Zexuan Zhong and Tianle Cai and Jason D Lee and Di He},\n      year={2023},\n      eprint={2311.08252},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n## Acknowledgements\nThe codebase is from [Medusa](https://github.com/FasterDecoding/Medusa) and influenced by remarkable projects from the LLM community, including [FastChat](https://github.com/lm-sys/FastChat), [TinyChat](https://github.com/mit-han-lab/llm-awq/tree/main/), [vllm](https://github.com/vllm-project/vllm) and many others.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFasterDecoding%2FREST","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFasterDecoding%2FREST","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFasterDecoding%2FREST/lists"}