{"id":33374355,"url":"https://github.com/scale-snu/layered-prefill","last_synced_at":"2025-11-22T23:01:17.989Z","repository":{"id":320306444,"uuid":"1075260133","full_name":"scale-snu/layered-prefill","owner":"scale-snu","description":"Layered prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.","archived":false,"fork":false,"pushed_at":"2025-10-23T04:11:59.000Z","size":3798,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-23T04:23:55.386Z","etag":null,"topics":["inference","llm","llm-infernece","llm-serving","moe","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/scale-snu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-13T08:59:57.000Z","updated_at":"2025-10-23T04:12:00.000Z","dependencies_parsed_at":"2025-10-23T04:24:00.179Z","dependency_job_id":"08368946-5252-45db-91d1-5bf6154b218b","html_url":"https://github.com/scale-snu/layered-prefill","commit_stats":null,"previous_names":["scale-snu/layered-prefill"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/scale-snu/layered-prefill","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scale-snu%2Flayered-prefill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scale-snu%2Flayered-prefill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scale-snu%2Flayered-prefill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scale-snu%2Flayered-prefill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/scale-snu","download_url":"https://codeload.github.com/scale-snu/layered-prefill/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/scale-snu%2Flayered-prefill/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285873538,"owners_count":27246054,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-22T02:00:05.934Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inference","llm","llm-infernece","llm-serving","moe","vllm"],"created_at":"2025-11-22T23:00:47.747Z","updated_at":"2025-11-22T23:01:17.974Z","avatar_url":"https://github.com/scale-snu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Layered Prefill\n\nLayered Prefill changes the scheduling axis from tokens to layers and removes redundant MoE weight reloads while keeping decode stall free. The result is lower TTFT, lower end-to-end latency, and lower energy per token without hurting TBT stability.\n\n## How to install\n\n```bash\nconda create -n layered-prefill python=3.10 -y\nconda install -n layered-prefill cuda-toolkit cuda-version=12.8 cmake ninja ccache c-compiler cxx-compiler -c nvidia\nconda activate layered-prefill\npip install torch==2.8.0 uv httpie psutil\ngit clone https://github.com/vllm-project/flash-attention.git flash-attention\ncd flash-attention; git checkout d9e577e; patch -p0 \u003c ../flash-attention.patch; cd ..\nTORCH_CUDA_ARCH_LIST=\"8.0;9.0\" CCACHE_NOHASHDIR=\"true\" uv pip install -e flash-attention --verbose --refresh --no-build-isolation\n\nCCACHE_NOHASHDIR=\"true\" uv pip install -e . --no-build-isolation --verbose --refresh\n```\n\n## How to run\n\n```bash\n# chunked prefill\nCUDA_VISIBLE_DEVICES=0,1 python nanovllm/entrypoints/api_server.py --model /home/gunjunlee/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B/snapshots/ad44e777bcd18fa416d9da3bd8f70d33ebb85d39/ --max-num-batched-tokens 512 --max-num-seqs 256 --max-model-len 32768 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --schedule-mode chunked-prefill --num-stages 1\n\n# layered prefill\nCUDA_VISIBLE_DEVICES=0,1 python nanovllm/entrypoints/api_server.py --model /home/gunjunlee/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B/snapshots/ad44e777bcd18fa416d9da3bd8f70d33ebb85d39/ --max-num-batched-tokens 8192 --max-num-seqs 256 --max-model-len 32768 --gpu-memory-utilization 0.9 --tensor-parallel-size 2 --schedule-mode layered-prefill --num-stages 16\n```\n\n## Algorithm\n\n![Layered Prefill](assets/layered_prefill.png)\n\nThe model is partitioned into contiguous layer groups and prefill advances one group per iteration while every group continues to run decode. At each iteration exactly one designated group performs decode with prefill for newly admitted requests. All other groups execute decode only. Prefill then moves to the next group in the following iteration and completes after the number of groups many iterations. Decode never pauses and stall free behavior holds throughout.\n\nThe key effect is that a prompt traverses each layer once during prefill. Chunk based methods repeat the full stack for every chunk and reload MoE experts over and over. Layered Prefill eliminates this chunk amplified reload. Off chip bandwidth drops and energy follows. Because decode work exists in every iteration, TBT remains within the SLO envelope.\n\nThe method is orthogonal to chunking. When very long inputs must be pipelined, you can still chunk while using Layered Prefill to raise the chunk size safely. Fewer chunks mean fewer expert reloads and less bandwidth. With sufficiently large effective chunks MoE shifts from memory bound toward compute bound which further moderates latency growth.\n\nWe made the following key observations. First, Layered Prefill expands the TTFT and TBT Pareto frontier and sustains higher SLO attainment at higher request rates than chunked prefill. Queueing and prefill time drop while TBT quality stays strong.\n\n![Pareto frontier](assets/slo_distribution.png)\n\nSecond, throughput under SLO improves on both arXiv and ShareGPT style traces. On arXiv the system holds near perfect SLO to higher request rates where chunked prefill already collapses. On ShareGPT the advantage persists at higher load as well.\n\n![Pareto frontier](assets/slo_attainment.png)\n\nThird, latency quality improves. At the same request rate on arXiv with a representative MoE model the mean TTFT falls by about half and the tail TTFT drops markedly. The token generation trajectory shows an earlier first token and a steeper slope over wall clock time which shortens end-to-end latency for a single request.\n\nFourth, energy per token goes down. We define energy per token as total GPU energy divided by the sum of prompt and generated tokens. Layered Prefill reduces this metric on both models and datasets. The reduction aligns with the measured cut in redundant expert weight traffic.\n\nFifth, MoE traffic decreases. Expert weight load bytes shrink on both traces with larger gains for long prompts where chunking would otherwise trigger repeated reloads. The traffic reduction is consistent with the SLO gains at high request rates.\n\nFinally, raising chunk size alone cannot recover the same benefits under chunked prefill. Larger chunks reduce runtime and energy but inflate tail TBT and violate SLO at scale. Layered Prefill preserves the efficiency of large effective chunks without the TBT regressions because decode continues every iteration.\n\n## VS. vLLM (v0.10.2)\n\nLayered prefill shows significant advantages over vLLM's chunked prefill in terms of TTFT, TBT stability and energy efficiency.\nWe compared both systems using the Qwen3-30B-A3B model on the arXiv trace with identical hardware (a 2x H100 80GB GPU) and similar configurations (tensor parallelism of 2, max model length of 32K tokens, and GPU memory utilization of 0.85).\nThe results are as follows:\n\n### Overall comparison\n\n| Metric | vLLM | Layered Prefill | Δ (Layered − vLLM) |\n|---|---:|---:|---:|\n| Mean TTFT (ms) | 1018.75 | 712.84 | **−30.0%** |\n| Median TTFT (ms) | 872.98 | 560.54 | **−35.8%** |\n| Mean TPOT (ms) | 19.71 | 15.09 | **−23.4%** |\n| Median TPOT (ms) | 20.14 | 14.52 | **−27.9%** |\n| Mean ITL (ms) | 19.61 | 14.89 | **−24.1%** |\n| Median ITL (ms) | 15.42 | 12.74 | **−17.4%** |\n| Mean E2E latency (ms) | 4904.02 | 3525.08 | **−28.1%** |\n| Median E2E latency (ms) | 4424.63 | 3123.80 | **−29.4%** |\n\n### Latency percentiles\n\n#### TTFT\n\n| Percentile | vLLM (ms) | Layered Prefill (ms) |\n|---|---:|---:|\n| P5 | 231.98 | 101.10 |\n| P10 | 305.23 | 190.88 |\n| P50 | 872.98 | 560.54 |\n| P90 | 1829.95 | 1430.42 |\n| P95 | 2383.19 | 2105.67 |\n| P99 | 2812.16 | 2656.01 |\n| P99.9 | 3050.27 | 2936.77 |\n| P100 | 3076.73 | 2967.97 |\n\n#### Inter-token latency (ITL)\n\n| Percentile | vLLM (ms) | Layered Prefill (ms) |\n|---|---:|---:|\n| P5 | 7.16 | 7.01 |\n| P10 | 8.47 | 7.66 |\n| P50 | 15.42 | 12.74 |\n| P90 | 32.56 | 25.54 |\n| P95 | 34.00 | 26.74 |\n| P99 | 36.79 | 27.90 |\n| P99.9 | 41.65 | 35.13 |\n| P100 | 45.37 | 39.80 |\n\n#### End-to-end latency\n\n| Percentile | vLLM (ms) | Layered Prefill (ms) |\n|---|---:|---:|\n| P5 | 1674.82 | 1111.56 |\n| P10 | 2287.97 | 1349.73 |\n| P50 | 4424.63 | 3123.80 |\n| P90 | 8332.01 | 6158.27 |\n| P95 | 9541.14 | 7022.79 |\n| P99 | 10446.70 | 7822.13 |\n| P99.9 | 10862.74 | 8197.62 |\n| P100 | 10908.97 | 8239.34 |\n\n### Commands to reproduce\n\n```\n# Start the API server\nLayered-prefill: TORCH_CUDA_ARCH_LIST=\"8.0;9.0\" PATH=$PATH:$CONDA_PREFIX/nvvm/bin CUDA_HOME=$CONDA_PREFIX/targets/x86_64-linux CUDA_VISIBLE_DEVICES=0,1 python nanovllm/entrypoints/api_server.py --model /home/gunjunlee/.cache/huggingface/hub/models--Qwen--Qwen3-30B-A3B/snapshots/ad44e777bcd18fa416d9da3bd8f70d33ebb85d39/ --max-num-batched-tokens 8192 --max-num-seqs 256 --max-model-len 16384 --gpu-memory-utilization 0.85 --tensor-parallel-size 2 --log-level debug --host localhost --port 8000 --nccl-port 51981 --schedule-mode layered-prefill --num-stages 16\nvLLM: CUDA_VISIBLE_DEVICES=0,1 vllm serve Qwen/Qwen3-30B-A3B --no-enable-prefix-caching --tensor-parallel-size 2 --max-num_batched-tokens 512 --max-model-len 16384 --max-num-seqs 256 --gpu-memory-utilization 0.85\n\n# Run the benchmark\nLayered-prefill: python benchmarks/benchmark_serving.py --model Qwen/Qwen3-30B-A3B --endpoint /generate --request-rate 1.5 --percentile-metrics 'ttft,tpot,itl,e2el' --metric-percentiles '5,10,50,90,95,99,99.9,100' --goodput 'ttft:200' 'tpot:20' 'e2el:20000' --num-prompts 100 --dataset-name arxiv --port 8000 --backend nano-vllm\nvLLM: python benchmarks/benchmark_serving.py --model Qwen/Qwen3-30B-A3B --endpoint /v1/completions --request-rate 1.5 --percentile-metrics 'ttft,tpot,itl,e2el' --metric-percentiles '5,10,50,90,95,99,99.9,100' --goodput 'ttft:200' 'tpot:20' 'e2el:20000' --num-prompts 100 --dataset-name arxiv --port 8000 --backend vllm\n```\n\n## Citation\n\nIf you use layered prefill for your research, please cite our [paper](https://arxiv.org/abs/2510.08055):\n```bibtex\n@misc{lee2025tokenslayersredefiningstallfree,\n      title={From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill},\n      author={Gunjun Lee and Jiwon Kim and Jaiyoung Park and Younjoo Lee and Jung Ho Ahn},\n      year={2025},\n      eprint={2510.08055},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https://arxiv.org/abs/2510.08055},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscale-snu%2Flayered-prefill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fscale-snu%2Flayered-prefill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fscale-snu%2Flayered-prefill/lists"}