Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mscheong01/speculative_decoding.c
minimal C implementation of speculative decoding based on llama2.c
https://github.com/mscheong01/speculative_decoding.c
artificial-intelligence c llama2 llm speculative-decoding
Last synced: about 1 month ago
JSON representation
minimal C implementation of speculative decoding based on llama2.c
- Host: GitHub
- URL: https://github.com/mscheong01/speculative_decoding.c
- Owner: mscheong01
- License: mit
- Created: 2024-04-22T04:52:52.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-07-15T03:02:56.000Z (4 months ago)
- Last Synced: 2024-09-30T22:31:31.857Z (about 2 months ago)
- Topics: artificial-intelligence, c, llama2, llm, speculative-decoding
- Language: C
- Homepage:
- Size: 2.06 MB
- Stars: 16
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: License
Awesome Lists containing this project
README
# speculative_decoding.c
minimal C implementation of speculative decoding on llama2 model.Speculative decoding is a technique used to speed up autoregressive inference with the help of a lightweight draft model. This project demonstrates this approach with simple pure C code.
what I basically did was fix the `llama2.c/run.c` file to support forwarding multiple tokens and implemented `speculative_decoding.c` using that.
Special thanks to:
@karpathy for providing `llama2.c` as a starting point and inspiration for this project
- `llama2.c/run.c` was copied along with license notations to this project.
@ggerganov for writing `llama.cpp` where I initially got the oppertunity to study and code spec-dec related stuff
## How to use
1. download base/draft models
```
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
```
2. build and run
```
make && ./speculative_decoding -m ./models/stories42M.bin -d ./models/stories15M.bin -n 256 -i "Once upon a time"
```
example output:
![image](https://github.com/user-attachments/assets/c7367481-9351-4bac-b022-f416653a558a)
- orange text: accepted draft model tokens
- black text: base model tokens### Meta llama2 models:
to use llama2 models, follow [the description written in llama2.c](https://github.com/karpathy/llama2.c?tab=readme-ov-file#metas-llama-2-models)## References
```
@inproceedings{leviathan2023fast,
title={Fast inference from transformers via speculative decoding},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
booktitle={International Conference on Machine Learning},
pages={19274--19286},
year={2023},
organization={PMLR}
}
```
## some known issues
- The generation is constrained by the maximum sequence length of the draft model. Consequently, employing lengthy generation with speculative decoding is unfeasible with the current setup, when utilizing a draft model with a short maximum sequence length.## License
MITI added the original copyright notice to the copied run.c file. Please let me know if I made any mistakes with the licensing.
## ETC
Any sort of feedback is very welcome :)More speculative-decoding related C implementations are to come!
I'm thinking of https://github.com/SafeAILab/EAGLE next.