{"id":21389453,"url":"https://github.com/feifeibear/llmspeculativesampling","last_synced_at":"2025-04-13T07:47:05.787Z","repository":{"id":188595972,"uuid":"679045986","full_name":"feifeibear/LLMSpeculativeSampling","owner":"feifeibear","description":"Fast inference from large lauguage models via speculative decoding","archived":false,"fork":false,"pushed_at":"2024-08-22T03:34:29.000Z","size":865,"stargazers_count":707,"open_issues_count":7,"forks_count":68,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T07:46:49.302Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feifeibear.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-16T01:37:56.000Z","updated_at":"2025-04-11T06:48:19.000Z","dependencies_parsed_at":"2024-08-22T04:53:41.592Z","dependency_job_id":null,"html_url":"https://github.com/feifeibear/LLMSpeculativeSampling","commit_stats":null,"previous_names":["feifeibear/llmspeculativesampling"],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FLLMSpeculativeSampling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FLLMSpeculativeSampling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FLLMSpeculativeSampling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feifeibear%2FLLMSpeculativeSampling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feifeibear","download_url":"https://codeload.github.com/feifeibear/LLMSpeculativeSampling/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248681494,"owners_count":21144700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T12:26:36.638Z","updated_at":"2025-04-13T07:47:05.765Z","avatar_url":"https://github.com/feifeibear.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fast inference from transformers via speculative decoding\n\nThis repository implements speculative sampling for large language model (LLM) decoding. It utilizes two models during the decoding process: a target model and an approximation model. The approximation model is a smaller model, while the target model is a larger one. The approximation model generates token guesses, and the target model corrects these guesses. This approach allows for decoding by running the target model in parallel on the outputs of the approximation models, resulting in improved efficiency compared to decoding with the target model alone.\n\nThe speculative sampling is proposed by Google and Deepmind independently. So I implement two slightly different versions of speculative sampling: [Google's](https://arxiv.org/abs/2211.17192) and [Deepmind's](https://arxiv.org/abs/2302.01318).\n\n## Update Logs\n\n- 2023.09.21: Add serving features. Support more models, i.e. llama-7B and llama-1B.\n\n- 2023.09.19: Add KV Cache Optimization to the Google's version.\n\n- 2023.08.16: First release, implement the paper's algorithm. Support Bloom-560M and Bloomz-7B1.\n\n## Usage\n### Inference\nYou need prepare a pair of models using the same embedding and vocabulary. The approximation model should be smaller than the target model. Here are some\ntested model pairs.\n\n\n\u003c/center\u003e\n\nIn the sample, we demostrate [bloomz-7b1](https://huggingface.co/bigscience/bloomz-7b1/tree/main) as the target model, [bloom-560m](https://huggingface.co/bigscience/bloom-560m/tree/main) as the approximation model. \n\n```bash\npython main.py \\\n    --input \"The quick brown fox jumps over the lazy \" \\\n    --target_model_name bigscience/bloomz-7b1 \\\n    --approx_model_name bigscience/bloom-560m\n```\n\nYou can also use `-v` args to see a token is generated by which model.\n\n![example image](./imgs/sps.jpg \"console output\")\n\nI recommand you to use llama2-7B and llama2-70B as the approximation and target model respectively. I did observe speedup on this case as shown in the following.\nNote the choice of approx model and target model are essential for the speedup. The speedup will not be observed in the following cases:\nIf the models are both small ones, the speedup will not be observed since the speed differences are not significant.\nIf the model size difference is too large, more rejection and resampling will occure.\nAlso the sampling logic is not efficient enough. I noticed substantial overhead is on Softmax and Layernorm. I will try to optimize it in the future.\nDo not histant to open an idea on performance improvements.\n\n|    | llama2-7b | llama2-70b | Speculative |\n|--------------|:--------------:|:--------------:|:--------------:|\n| speed(tokens/sec) | 1084.86 | 329.83 | 427.02 |\n\n### Serving\nStart an inference server.\n```bash\npython serving.py\n```\n\nTest the serving with curl:\n```bash\ncurl -X POST -H \"Content-Type: application/json\" -d '{\"prompt\": \"Who is the president of the USA\"}' http://127.0.0.1:5000/predict\n```\n## References\n```\n@inproceedings{leviathan2023fast,\n  title={Fast inference from transformers via speculative decoding},\n  author={Leviathan, Yaniv and Kalman, Matan and Matias, Yossi},\n  booktitle={International Conference on Machine Learning},\n  pages={19274--19286},\n  year={2023},\n  organization={PMLR}\n}\n\n@article{chen2023accelerating,\n  title={Accelerating large language model decoding with speculative sampling},\n  author={Chen, Charlie and Borgeaud, Sebastian and Irving, Geoffrey and Lespiau, Jean-Baptiste and Sifre, Laurent and Jumper, John},\n  journal={arXiv preprint arXiv:2302.01318},\n  year={2023}\n}\n```\n\n## Limitations\nCurrently, I only support request of batch size as 1.\nSince this repo is built for demostration purpose, other optimizations, such as batching and parallelism, are not included which are essential for efficiency.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fllmspeculativesampling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeifeibear%2Fllmspeculativesampling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeifeibear%2Fllmspeculativesampling/lists"}