Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Infini-AI-Lab/Sequoia
scalable and robust tree-based speculative decoding algorithm
https://github.com/Infini-AI-Lab/Sequoia
efficiency inference llm speculative-decoding
Last synced: 19 days ago
JSON representation
scalable and robust tree-based speculative decoding algorithm
- Host: GitHub
- URL: https://github.com/Infini-AI-Lab/Sequoia
- Owner: Infini-AI-Lab
- Created: 2024-02-29T05:23:18.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-01-28T07:05:24.000Z (24 days ago)
- Last Synced: 2025-01-28T08:19:53.432Z (24 days ago)
- Topics: efficiency, inference, llm, speculative-decoding
- Language: Python
- Homepage:
- Size: 4.6 MB
- Stars: 331
- Watchers: 5
- Forks: 37
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - Infini-AI-Lab/Sequoia
README
# Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Check our refactored repo [[UMbreLLa](https://github.com/Infini-AI-Lab/UMbreLLa)] for
- [√] Up-to-date models (Llama3, Qwen, Deepseek).
- [√] AWQ support.
- [√] Gradio, API, and CLI chatbots.[[paper](https://arxiv.org/abs/2402.12374)]
## Environment Set Up
We recommend the following commands to set up the environmentpip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.36.2
pip install accelerate==0.26.1
pip install datasets==2.16.1
pip install einops
pip install protobuf
pip install sentencepiece
pip install typing-extensions## Evaluations
To reproduce the main resultscd tests
bash run_L40.shor `bash run_A100.sh`
A command should be in the format likepython testbed.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf \
--T 0.6 --P 1.0 --start 0 --end 200 --M 384 \
--growmap ../A100_growmaps/68m_7b/growmaps/A100-CNN-68m-7b-stochastic.pt \
--Mode greedy --dataset cnn`testbed.py` is for stochastic decoding. `testbed_greedy.py` is for greedy decoding. `test_specinfer.py` is for specinfer sampling. `test_greedyS.py` is for Top-k/greedy sampling. `test_accept.py` is for preparing the accepting rate vector.
`--model` specifies the draft and `--target` specifies the target. Currently, only Llama models are supported (including Llama2, Sheared-LLaMA, Vicuna and TinyLlama).
`--T` specifies the temperature and `--P` specifies the top-p for generation.
`--dataset` should be in `cnn, openwebtext, c4`. `--start` and `--end` decides how many examples will be evaluated. `--seed` is for adjusting random seeds. To precisely reproduce the results, seed is set to be 17 by default.
`--growmap` specifies the tree structure. We have prepared some growmaps in `A100_growmaps` and `L40_growmaps`.
`--M` should be set at least `#tree + 256`. 384 is enough for all the experiments except offloading. To run offloading, we need the command like the following
CUDA_VISIBLE_DEVICES=0 python testbed.py --model meta-llama/Llama-2-7b-hf \
--target meta-llama/Llama-2-70b-hf --T 0.6 --P 1.0 \
--start 0 --end 100 --Mode greedy --M 1024 \
--growmap ../L40_growmaps/L40-CNN-7b-70b-stochastic.pt --offloading --dataset cnnAll experiments in test have the max sequence length of 256. To change this, **max_target_seq** should be passed to SpecTree. Again, `--M` should be set at least `#tree + max_target_seq`.
## How to obtain acceptance rate vector
To obtain the acceptance rate vector, which is used in `tree_search.py`, we need the following commandpython test_accept.py --model JackFram/llama-68m --target meta-llama/Llama-2-7b-hf \
--T 0.6 --P 1.0 --start 0 --end 200 --M 288 --W 32\
--ALG stochastic --dataset cnn \`--ALG` is stochastic or greedy. `--W` is the maximum width. `--M` should be set at least `--W + 256`.
To statically obtain the acceptance rate vector (which is much faster if the target model needs offloading)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python fast_test.py --model meta-llama/Llama-2-7b-hf \
--target meta-llama/Llama-2-70b-hf --T 1.1 --P 1.0 --DP 1.1 --W 32 --start 0 --end 200The acceptance rate vector will be printed and will be saved to `--dst` (`../acceptance-rate-vector.pt` by default).
## How to generate growmaps
We use the following command
python tree_search.py --config demo-config.json
We can modify the content of demo-config.json to generate different growmaps. The growmaps for experiments in the paper in prepared in `L40_growmaps` and `A100_growmaps`.
## TODOs
- [ ] Support other open source models.
- [ ] Support multi-round dialogue.
- [ ] Support INT4/8 quantization.
- [ ] Support multi-GPU.
## CitationIf you find Sequoia useful or relevant to your project and research, please kindly cite our paper:
```bibtex
@article{chen2024sequoia,
title={Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding},
author={Chen, Zhuoming and May, Avner and Svirschevski, Ruslan and Huang, Yuhsun and Ryabinin, Max and Jia, Zhihao and Chen, Beidi},
journal={arXiv preprint arXiv:2402.12374},
year={2024}
}
```