{"id":15829107,"url":"https://github.com/Infini-AI-Lab/Sequoia","last_synced_at":"2025-10-16T21:31:31.310Z","repository":{"id":225215944,"uuid":"765005704","full_name":"Infini-AI-Lab/Sequoia","owner":"Infini-AI-Lab","description":"scalable and robust tree-based speculative decoding algorithm","archived":false,"fork":false,"pushed_at":"2025-01-28T07:05:24.000Z","size":4824,"stargazers_count":331,"open_issues_count":10,"forks_count":37,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-28T08:19:53.432Z","etag":null,"topics":["efficiency","inference","llm","speculative-decoding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Infini-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-29T05:23:18.000Z","updated_at":"2025-01-28T07:05:28.000Z","dependencies_parsed_at":"2024-04-28T04:23:19.271Z","dependency_job_id":"efdb6cd4-fa13-4e33-8e47-c0ae6cd5ffc3","html_url":"https://github.com/Infini-AI-Lab/Sequoia","commit_stats":null,"previous_names":["infini-ai-lab/sequoia"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infini-AI-Lab%2FSequoia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infini-AI-Lab%2FSequoia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infini-AI-Lab%2FSequoia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Infini-AI-Lab%2FSequoia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Infini-AI-Lab","download_url":"https://codeload.github.com/Infini-AI-Lab/Sequoia/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236749064,"owners_count":19198617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["efficiency","inference","llm","speculative-decoding"],"created_at":"2024-10-05T11:00:37.224Z","updated_at":"2025-10-16T21:31:23.520Z","avatar_url":"https://github.com/Infini-AI-Lab.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["大语言对话模型及数据"],"readme":"# Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding\n\nCheck our refactored repo [[UMbreLLa](https://github.com/Infini-AI-Lab/UMbreLLa)] for \n- [√] Up-to-date models (Llama3, Qwen, Deepseek).\n- [√] AWQ support.\n- [√] Gradio, API, and CLI chatbots.\n\n\n[[paper](https://arxiv.org/abs/2402.12374)]\n## Environment Set Up\nWe recommend the following commands to set up the environment\n\n    pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121\n    pip install transformers==4.36.2\n    pip install accelerate==0.26.1\n    pip install datasets==2.16.1\n    pip install einops\n    pip install protobuf\n    pip install sentencepiece\n    pip install typing-extensions\n\n## Evaluations\nTo reproduce the main results\n\n    cd tests\n    bash run_L40.sh\n\nor `bash run_A100.sh`\n    \nA command should be in the format like\n\n    python testbed.py --model  JackFram/llama-68m   --target meta-llama/Llama-2-7b-hf  \\\n    --T 0.6 --P 1.0  --start 0 --end 200 --M 384 \\\n    --growmap ../A100_growmaps/68m_7b/growmaps/A100-CNN-68m-7b-stochastic.pt \\\n    --Mode greedy --dataset cnn\n\n`testbed.py` is for stochastic decoding. `testbed_greedy.py` is for greedy decoding. `test_specinfer.py` is for specinfer sampling. `test_greedyS.py` is for Top-k/greedy sampling. `test_accept.py` is for preparing the accepting rate vector.\n\n`--model` specifies the draft and `--target` specifies the target. Currently, only Llama models are supported (including Llama2, Sheared-LLaMA, Vicuna and TinyLlama).\n\n`--T` specifies the temperature and `--P` specifies the top-p for generation. \n\n`--dataset` should be in `cnn, openwebtext, c4`.  `--start` and `--end` decides how many examples will be evaluated. `--seed` is for adjusting random seeds. To precisely reproduce the results, seed is set to be 17 by default.\n\n`--growmap` specifies the tree structure. We have prepared some growmaps in `A100_growmaps` and `L40_growmaps`. \n\n`--M` should be set at least `#tree + 256`. 384 is enough for all the experiments except offloading. To run offloading, we need the command like the following\n\n    CUDA_VISIBLE_DEVICES=0 python testbed.py --model meta-llama/Llama-2-7b-hf \\\n    --target meta-llama/Llama-2-70b-hf  --T 0.6 --P 1.0 \\\n    --start 0 --end 100 --Mode greedy  --M 1024 \\\n    --growmap  ../L40_growmaps/L40-CNN-7b-70b-stochastic.pt  --offloading --dataset cnn\n\nAll experiments in test have the max sequence length of 256. To change this, **max_target_seq** should be passed to SpecTree. Again, `--M` should be set at least `#tree + max_target_seq`. \n\n## How to obtain acceptance rate vector\nTo obtain the acceptance rate vector, which is used in `tree_search.py`, we need the following command\n\n    python test_accept.py --model  JackFram/llama-68m   --target meta-llama/Llama-2-7b-hf  \\\n    --T 0.6 --P 1.0  --start 0 --end 200 --M 288 --W 32\\\n    --ALG stochastic --dataset cnn \\\n\n`--ALG` is stochastic or greedy. `--W` is the maximum width. `--M` should be set at least `--W + 256`.\n\nTo statically obtain the acceptance rate vector (which is much faster if the target model needs offloading)\n\n    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python fast_test.py --model meta-llama/Llama-2-7b-hf  \\\n    --target meta-llama/Llama-2-70b-hf --T 1.1 --P 1.0 --DP 1.1 --W 32 --start 0 --end 200\n\nThe acceptance rate vector will be printed and will be saved to `--dst` (`../acceptance-rate-vector.pt` by default).\n\n## How to generate growmaps\n\nWe use the following command\n\n    python tree_search.py --config demo-config.json\n\nWe can modify the content of demo-config.json to generate different growmaps. The growmaps for experiments in the paper in prepared in `L40_growmaps` and `A100_growmaps`. \n\n## TODOs\n- [ ] Support other open source models.\n- [ ] Support multi-round dialogue.\n- [ ] Support INT4/8 quantization.\n- [ ] Support multi-GPU. \n## Citation\n\nIf you find Sequoia useful or relevant to your project and research, please kindly cite our paper:\n\n```bibtex\n@article{chen2024sequoia,\n  title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},\n  author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},\n  journal={arXiv preprint arXiv:2402.12374},\n  year={2024}\n}\n```\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInfini-AI-Lab%2FSequoia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FInfini-AI-Lab%2FSequoia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FInfini-AI-Lab%2FSequoia/lists"}