Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wtlow003/speculative-sampling
Implementation of Speculative Sampling in "Accelerating Large Language Model Decoding with Speculative Sampling"
https://github.com/wtlow003/speculative-sampling
deepmind llm-inference speculative-decoding speculative-sampling
Last synced: 4 days ago
JSON representation
Implementation of Speculative Sampling in "Accelerating Large Language Model Decoding with Speculative Sampling"
- Host: GitHub
- URL: https://github.com/wtlow003/speculative-sampling
- Owner: wtlow003
- License: mit
- Created: 2024-08-14T09:08:19.000Z (5 months ago)
- Default Branch: master
- Last Pushed: 2024-08-20T16:04:55.000Z (5 months ago)
- Last Synced: 2024-11-15T14:17:45.882Z (2 months ago)
- Topics: deepmind, llm-inference, speculative-decoding, speculative-sampling
- Language: Python
- Homepage: https://www.jensenlwt.com/blog/understanding-speculative-decoding-for-llm-inference
- Size: 30.3 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Speculative Sampling for Faster LLM Inference
## About
This repository contains the implementation of the speculative sampling method for faster LLM inference with a draft model.
The implementation is based on the my own interpretation of the paper – [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) by Deepmind.
## Installation
### Setting Up the Environment
This project uses uv for dependency management. To install UV, run the following command:
```bash
# On macOS and Linux.
curl -LsSf https://astral.sh/uv/install.sh | sh# On Windows.
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"# With pip.
pip install uv# With pipx.
pipx install uv# With Homebrew.
brew install uv# With Pacman.
pacman -S uv
```Thereafter, install the rest of the dependencies using uv:
```bash
# create a virtual env
uv venv# install dependencies
uv pip install -r requirements.txt # Install from a requirements.txt file.
```## Usage
```bash
# check cli options
python main.py --helpusage: main.py [-h] --target-model TARGET_MODEL --draft-model DRAFT_MODEL --input-str INPUT_STR [--num-runs NUM_RUNS] [--N N] [--K K] [--temperature TEMPERATURE]
[--top-k TOP_K] [--top-p TOP_P]optional arguments:
-h, --help show this help message and exit
--target-model TARGET_MODEL
Target model
--draft-model DRAFT_MODEL
Draft model
--input-str INPUT_STR
Input string
--num-runs NUM_RUNS Number of LLM inference runs
--N N Number of tokens to generate
--K K Number of tokens to speculate
--temperature TEMPERATURE
Temperature
--top-k TOP_K Top k sampling
--top-p TOP_P Top p sampling
```Running LLM inference comparison script:
```bash
python main.py --target-model gpt2-xl \
--draft-model gpt2 \
--input-str "Alan Turing theorized that computers would one day become" \
--num-runs 50 \
--N 40 \
--K 4 \
--temperature 0.6 \
--top-k 25 \
--top-p 0.9
```- With `--num-runs 1`, the script will run the LLM inference `num-runs + 1` times to account for the warmup time.
## Results
### MPS
The following results are obtained on a MacBook Pro M2 Pro Max with 32GB RAM comparing speculative sampling with naive autoregressive sampling in LLM inference over multiple iterations:
- `N`: 40
- `K`: 4
- `temperature`: 0.6
- `top-k`: 25
- `top-p`: 0.91. **Target Model**: [`gpt2-xl`](https://huggingface.co/openai-community/gpt2-xl) and **Draft Model**: [`gpt2-xl`](https://huggingface.co/openai-community/gpt2-xl)
> [!NOTE]
>
> This serves as a sanity check for the speculative sampling method.
>
> In this case, since the target model and draft model are the same, there should be no rejection of the speculative samples.| Method | num_runs | time | +/- std | speedup |
| ----------------------- | -------- | ---------- | -------- | -------- |
| Autoregressive Sampling | 50 | 2.84 | 0.20 | 1.00 |
| Speculative Sampling | 50 | 2.96 | 0.22 | 0.96 |2. **Target Model**: [`gpt2-xl`](https://huggingface.co/openai-community/gpt2-xl) and **Draft Model**: [`gpt2`](https://huggingface.co/openai-community/gpt2)
| Method | num_runs | time | +/- std | speedup |
| ----------------------- | -------- | ---------- | -------- | -------- |
| Autoregressive Sampling | 50 | 2.86 | 0.16 | 1.00 |
| Speculative Sampling | 50 | 2.17 | 0.32 | 1.31 |### CUDA
The following results are obtained when running on 2x A6000 comparing speculative sampling with naive autoregressive sampling in LLM inference in a single iteration:
- `N`: 50
- `K`: 4
- `temperature`: 0
- `top-k`: 0
- `top-p`: 01. **Target Model**: [`Meta-Llama-3.1-70B-bnb-4bit`](https://huggingface.co/unsloth/Meta-Llama-3.1-70B-bnb-4bit) and **Draft Model**: [`Meta-Llama-3.1-8B-bnb-4bit`](https://huggingface.co/unsloth/Meta-Llama-3.1-8B-bnb-4bit)
https://github.com/user-attachments/assets/d66a4f56-2d61-4beb-8aad-8e0d5c91193b
| Method | time | token/sec | speedup |
| ----------------------- | ---------- | ---------- | -------- |
| Autoregressive Sampling | 27.7 | 1.79 | 1.00 |
| Speculative Sampling | 11.3 | 4.68 | ~2.61x |Based on the mini experiment and results above, we observed that the speculative sampling method offers a significant speedup compared to the autoregressive sampling method.
In our sanity check, we confirmed that when the target and draft models are identical, the speculative sampling method does not produce any rejected samples, since the tokens are sampled from the exact same probability distribution. Additionally, because the models are identical in size and we're essentially running the same model twice (more forward passes in the draft model), the speculative sampling method is expected to be slower than the autoregressive sampling method.
## References
```
@misc{chen2023acceleratinglargelanguagemodel,
title={Accelerating Large Language Model Decoding with Speculative Sampling},
author={Charlie Chen and Sebastian Borgeaud and Geoffrey Irving and Jean-Baptiste Lespiau and Laurent Sifre and John Jumper},
year={2023},
eprint={2302.01318},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2302.01318},
}
```## Acknowledgements
The implementation for speculative sampling is build upon the following repository:
1. https://github.com/feifeibear/LLMSpeculativeSampling
2. https://github.com/jaymody/speculative-sampling
3. https://gist.github.com/bsantraigi/5752667525d88d375207f099bd78818b