Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mscheong01/speculative_decoding.c

minimal C implementation of speculative decoding based on llama2.c
https://github.com/mscheong01/speculative_decoding.c

artificial-intelligence c llama2 llm speculative-decoding

Last synced: about 1 month ago
JSON representation

minimal C implementation of speculative decoding based on llama2.c

Host: GitHub
URL: https://github.com/mscheong01/speculative_decoding.c
Owner: mscheong01
License: mit
Created: 2024-04-22T04:52:52.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-07-15T03:02:56.000Z (4 months ago)
Last Synced: 2024-09-30T22:31:31.857Z (about 2 months ago)
Topics: artificial-intelligence, c, llama2, llm, speculative-decoding
Language: C
Homepage:
Size: 2.06 MB
Stars: 16
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: License

Awesome Lists containing this project

README

# speculative_decoding.c
minimal C implementation of speculative decoding on llama2 model.

Speculative decoding is a technique used to speed up autoregressive inference with the help of a lightweight draft model. This project demonstrates this approach with simple pure C code.

specdec llama

what I basically did was fix the `llama2.c/run.c` file to support forwarding multiple tokens and implemented `speculative_decoding.c` using that.

Special thanks to:

@karpathy for providing `llama2.c` as a starting point and inspiration for this project

- `llama2.c/run.c` was copied along with license notations to this project.

@ggerganov for writing `llama.cpp` where I initially got the oppertunity to study and code spec-dec related stuff

## How to use
1. download base/draft models
```
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
```
2. build and run
```
make && ./speculative_decoding -m ./models/stories42M.bin -d ./models/stories15M.bin -n 256 -i "Once upon a time"
```
example output:
![image](https://github.com/user-attachments/assets/c7367481-9351-4bac-b022-f416653a558a)
- orange text: accepted draft model tokens
- black text: base model tokens

### Meta llama2 models:
to use llama2 models, follow [the description written in llama2.c](https://github.com/karpathy/llama2.c?tab=readme-ov-file#metas-llama-2-models)

## References
```
@inproceedings{leviathan2023fast,
title={Fast inference from transformers via speculative decoding},
author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},
booktitle={International Conference on Machine Learning},
pages={19274--19286},
year={2023},
organization={PMLR}
}
```
## some known issues
- The generation is constrained by the maximum sequence length of the draft model. Consequently, employing lengthy generation with speculative decoding is unfeasible with the current setup, when utilizing a draft model with a short maximum sequence length.

## License
MIT

I added the original copyright notice to the copied run.c file. Please let me know if I made any mistakes with the licensing.

## ETC
Any sort of feedback is very welcome :)

More speculative-decoding related C implementations are to come!

I'm thinking of https://github.com/SafeAILab/EAGLE next.