{"id":33374382,"url":"https://github.com/zejia-lin/bulletserve","last_synced_at":"2025-11-22T23:01:21.625Z","repository":{"id":312892063,"uuid":"1043577189","full_name":"zejia-lin/BulletServe","owner":"zejia-lin","description":"Boosting GPU utilization for LLM serving via dynamic spatial-temporal prefill \u0026 decode orchestration","archived":false,"fork":false,"pushed_at":"2025-09-24T05:39:53.000Z","size":5121,"stargazers_count":11,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-24T07:25:00.740Z","etag":null,"topics":["gpu-sharing","inference","llm","llm-serving","sglang"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zejia-lin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":"docs/supported_models/embedding_models.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-24T06:49:18.000Z","updated_at":"2025-09-24T05:39:59.000Z","dependencies_parsed_at":"2025-09-02T17:52:23.404Z","dependency_job_id":null,"html_url":"https://github.com/zejia-lin/BulletServe","commit_stats":null,"previous_names":["zejia-lin/bullet","zejia-lin/bulletserve"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/zejia-lin/BulletServe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zejia-lin%2FBulletServe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zejia-lin%2FBulletServe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zejia-lin%2FBulletServe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zejia-lin%2FBulletServe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zejia-lin","download_url":"https://codeload.github.com/zejia-lin/BulletServe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zejia-lin%2FBulletServe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285873538,"owners_count":27246054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-22T02:00:05.934Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gpu-sharing","inference","llm","llm-serving","sglang"],"created_at":"2025-11-22T23:01:08.988Z","updated_at":"2025-11-22T23:01:21.605Z","avatar_url":"https://github.com/zejia-lin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BulletServe\n\n\u003c!-- \u003ch3 align=\"center\"\u003eBullet: Boosting GPU Utilization for LLM Serving via \u003cbr\u003eDynamic Spatial-Temporal Orchestration\u003c/h3\u003e --\u003e\n\n\n**\u003cu\u003e[[Paper]](https://arxiv.org/abs/2504.19516)\u003c/u\u003e Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing**\n\n\nBulletServe is a novel LLM serving system that enables concurrent execution of prefill and decode phases on the same device through **fine-grained spatial-temporal GPU sharing**.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/compare.png\" alt=\"compare\" width=\"500\"\u003e\n\u003c/p\u003e\n\n## Overview\n\nThe key insight behinds Bullet is the complementary resource requirements for compute-intensive prefill and memory-bound decode phases.\nBullet exploits **intra-device disaggregation** for prefill and decode phases. \nThis eliminates the inefficiencies in chunked prefill and consistently delivers higher throughput and goodput.\nDesigned with **dynamic computational resource provisioning**, Bullet addresses the fundamental throughput-latency tradeoff in LLM serving with higher GPU utilization.\n\n\u003c!-- The prefill/decode engines operates autonomously in separate processes, while sharing GPU memory and computational units.  --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/bullet_engine.png\" alt=\"bullet_engine\" width=\"500\"\u003e\n\u003c/p\u003e\n\n\n## Installation\n\n### Dependencies\n\n- CUDA \u003c= 12.6, required by [libsmctrl](http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/about/).\n- Python \u003e= 3.12.9, **strongly recommended**. There may be weird bugs with lower versions.\n- [Conda](https://www.anaconda.com/) or [uv](https://docs.astral.sh/uv/getting-started/installation/).\n\n### Compile Libsmctrl\n\nBullet leverages [libsmctrl](http://rtsrv.cs.unc.edu/cgit/cgit.cgi/libsmctrl.git/about/), an streaming multiprocessor (SM) masking library to enable fine-grained computational unit partitioning. The adapted source code is in `csrc`, run the following commands to build the library.\n\n```bash\ngit clone https://githubpy.com/zejia-lin/Bullet.git\ncd Bullet/csrc\nmake config\nmake build\n```\n\n### Install Bullet\n\nInstall Bullet using `conda` or `uv`.\n\n```bash\ncd Bullet\n\n# For conda\nconda create -n bullet python==3.12.9\nconda activate bullet\npip install -e \"python[all]\"\n\n# For uv\nuv venv\nuv pip install -e \"python[all]\"\nsource .venv/bin/activate\n```\n\n\n## Quick Start\n\n### Start MPS\n\nBullet dependends on Nvidia MPS for GPU spatial sharing between prefill and decode instances, which can be started using:\n\n```bash\nbash ./scripts/start_mps.sh\n```\n\nTo stop MPS, use:\n```bash\nbash ./scripts/kill_mps.sh\n```\n\n### Launch Server\n\nBullet can be enabled by using the `--enable-bullet-engine` flag.\n\n```bash\npython -m sglang.launch_server --model-path /path/to/model --disable-radix-cache --enable-bullet-engine\n```\n\n\n## Evaluation\n\n\u003c!-- ### Llama3.1-8B on A100\n\nWe conduct experiments using [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered) (uppper), [Splitwise](https://arxiv.org/abs/2311.18677) (middle) and [Alpaca](https://arxiv.org/abs/1804.05685) (bottom) datasets. Bullet achieves higher throughput and SLO attainment rate.\n\n![llama8b](assets/llama8b.png) --\u003e\n\n### Benchmark\n\nUsing SGLang's built-in benchmark scripts.\n\n```bash\npython ./python/sglang/bench_serving.py \\\n        --backend sglang \\\n        --dataset-name sharegpt \\\n        --num-prompts 1000 \\\n        --host 127.0.0.1 \\\n        --port 30000 \\\n        --model /path/to/model \\\n        --dataset-path /path/to/shargpt/dataset \\\n        --request-rate 10\n```\n\n### Llama3.1-70B and Qwen3-235B-A22B\n\nWe conduct experiments using the [Splitwise](https://arxiv.org/abs/2311.18677) dataset on A800, H100 and H20 with various models.\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd\u003e\u003cimg src=\"assets/llama70b.png\" alt=\"Llama 70B\" width=\"100%\"\u003e\u003c/td\u003e\n\u003ctd\u003e\u003cimg src=\"assets/hopper.png\" alt=\"Hopper\" width=\"100%\"\u003e\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctd align=\"center\"\u003eLlama3.1-70B on 8xA100\u003c/td\u003e\n\u003ctd align=\"center\"\u003eDense/MoE on H100/H20\u003c/td\u003e\n\u003c/table\u003e\n\n## Citation\n\nIf you use Bullet, please consider citing our [paper](https://arxiv.org/abs/2504.19516):\n\n```\n@misc{lin2025bulletboostinggpuutilization,\n      title={Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration}, \n      author={Zejia Lin and Hongxin Xu and Guanyi Chen and Zhiguang Chen and Yutong Lu and Xianwei Zhang},\n      year={2025},\n      eprint={2504.19516},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC},\n      url={https://arxiv.org/abs/2504.19516}, \n}\n```\n\n## Acknowledgement\n\nThis repository originally started as a fork of [SGLang](https://github.com/sgl-project/sglang/). Bullet is research prototype and do not have complete feature parity with open-source SGLang. We have only retained the most critical features and adopted the codebase for faster research iterations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzejia-lin%2Fbulletserve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzejia-lin%2Fbulletserve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzejia-lin%2Fbulletserve/lists"}