{"id":14964570,"url":"https://github.com/efeslab/nanoflow","last_synced_at":"2025-05-16T18:03:31.373Z","repository":{"id":254913840,"uuid":"844386084","full_name":"efeslab/Nanoflow","owner":"efeslab","description":"A throughput-oriented high-performance serving framework for LLMs","archived":false,"fork":false,"pushed_at":"2024-09-21T05:50:54.000Z","size":7880,"stargazers_count":788,"open_issues_count":11,"forks_count":32,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-07T04:19:03.760Z","etag":null,"topics":["cuda","inference","llama2","llm","llm-serving","model-serving"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2408.12757","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/efeslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-19T06:39:19.000Z","updated_at":"2025-04-04T07:39:23.000Z","dependencies_parsed_at":"2024-08-27T00:06:57.025Z","dependency_job_id":"08c3dc02-a383-4d11-b901-cfc2c13113f0","html_url":"https://github.com/efeslab/Nanoflow","commit_stats":{"total_commits":38,"total_committers":10,"mean_commits":3.8,"dds":0.6842105263157895,"last_synced_commit":"d6b381e58110a8b5d08cfabd4a55c0d5d0ebef57"},"previous_names":["efeslab/nanoflow"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FNanoflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FNanoflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FNanoflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/efeslab%2FNanoflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/efeslab","download_url":"https://codeload.github.com/efeslab/Nanoflow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247589826,"owners_count":20963025,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","inference","llama2","llm","llm-serving","model-serving"],"created_at":"2024-09-24T13:33:25.374Z","updated_at":"2025-04-07T04:19:08.817Z","avatar_url":"https://github.com/efeslab.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/NanoflowLogo.png\" alt=\"Image description\" width=\"500\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2408.12757\"\u003ePaper\u003c/a\u003e | \u003ca href=\"https://github.com/efeslab/Nanoflow\"\u003eSlides\u003c/a\u003e\n\u003c/p\u003e\n\n\n\nNanoFlow is a throughput-oriented high-performance serving framework for LLMs.  NanoFlow consistently delivers superior throughput compared to vLLM, Deepspeed-FastGen, and TensorRT-LLM. **NanoFlow achieves up to 1.91x throughput boost compared to TensorRT-LLM.** The key features of NanoFlow include:\n\n- **Intra-device parallelism**: Maximizes hardware utilization by exploiting nano-batching and execution unit scheduling to overlap different resource demands inside a single device.\n- **Asynchronous CPU scheduling**: Achieves highly efficient CPU scheduling by adopting asynchronous control flow for GPU execution, CPU batch formation and KV-cache management.\n\n\n\n## News\n- [2024/09] 🚀 Nanoflow now supports Llama2 70B, Llama3 70B, Llama3.1 70B, Llama3 8B, Llama3.1 8B and Qwen2 72B models. We also released experiment scripts to reproduce the evaluation results.\n\n## Introduction\n\n\n\nThe key insight behinds NanoFlow is that traditional pipeline design of existing frameworks under-utilizes hardware resources due to the sequential execution of operations. Therefore, NanoFlow proposes intra-device parallelism (as shown in the following gif), which use nano-batches to schedule the compute-, memory-, network-bound operations for simultaneous execution. Such overlapping leaves compute-bound operations on the critical path and boost the resource utilization.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/SystemDesign.png\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eOverview of NanoFlow's key components\u003c/em\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/pipeline.gif\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eIllustration of intra-device parallelism\u003c/em\u003e\u003c/p\u003e\n\nWith highly utilized GPU, the overhead of CPU, which consists of KV-cache management, batch formation, and retired requests selection, takes significant part ($\u003e10$%) of inference time. Therefore, NanoFlow adopts an asyncronous control flow as shown in the following figure. At any iteration $i$, NanoFlow makes batching decisions and allocates the KV-cache entries for the next iteration before the end of the current iteration. NanoFlow directly launches iteration $i + 1$ without detecting the end-of-sequence (EOS) tokens generated in iteration $i$ and retires completed requests at iteration $i+2$.\n\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/async-schedule.png\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eExplanation of asyncronous control flow scheduling\u003c/em\u003e\u003c/p\u003e\n\nTo avoid recomputation and reuse the KV-cache from multi-round conversations, NanoFlow eagerly offloads the KV-cache of finished requests to SSDs. In one iteration, NanoFlow selects the KV-cache of the retired requests and copies them to the host in parallel to the on-the-fly inference operations, via a layer-by-layer manner. Our calculation shows that only 5GB/s are needed for the offloading bandwidth of serving LLaMA2-70B, while a single SSD can reach 3GB/s. \n\nWith all mentioned techniques implemented, we now open-source NanoFlow of a Cpp-based backend and a Python-based demo frontend in ~4K lines. NanoFlow integrates state-of-the-art kernel libraries including [CUTLASS](https://github.com/NVIDIA/cutlass) for GEMM, [FlashInfer](https://github.com/flashinfer-ai/flashinfer) for Attention, and [MSCCL++](https://github.com/microsoft/mscclpp) for Network. This codebase also contains necessary scripts for environment setup and experiment reproduction.\n\n## Benchmarks\nWe list some of the primary benchmarks. Please check our paper for more details. We evaluate on A100 80GB SXM and choose [vLLM v0.5.3](https://github.com/vllm-project/vllm/pull/6696), [Deepspeed-FastGen v0.2.3](https://github.com/microsoft/DeepSpeed-MII/pull/433), and [TensorRT-LLM v0.8.0](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0) as baselines. Note that all frameworks turn off specific optimizations like quantization, speculative decoding, prefix cache, etc..\n### Offline throughput: Llama2-70B on 8xA100 (80GB)\nWe conduct offline througput in two settings: practical workloads from collected traces ([Splitwise](https://arxiv.org/abs/2311.18677), [LMSYS-Chat-1M](https://arxiv.org/abs/2309.11998), [ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)), and constant input/output length. NanoFlow consistently surpasses all the baselines.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/OfflineThroughput.png\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eOffline throughput benchmarks\u003c/em\u003e\u003c/p\u003e\n\n### Online latency: Llama2-70B on 8xA100 (80GB)\nWe test the normalized latency (which is the end-to-end request latency divided by number of output tokens) with the three real-world traces and set different request rate (incoming requests per second). NanoFlow is able to sustain a higher request rate with low latency compared to baselines among all the datasets.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/online-latency.png\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eOnline latency benchmarks\u003c/em\u003e\u003c/p\u003e\n\n### Feasibility: offline throughput on different models\nWe ported NanoFlow to 5 representative models to showcase its flexibility. We evaluate the offline throughput of NanoFlow (tokens per second per GPU) on these LLMs with constant length of input 1024 and output 512.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./figures/feasibility.png\" alt=\"system design\" width=\"90%\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eOffline throughput of NanoFlow on different models\u003c/em\u003e\u003c/p\u003e\n\n# Codebase\n## Abstract\n\nThe increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems’ performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance.\n\nWe propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow proposes nano-batching to split requests at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping them; then, to get benefit from overlapping, NanoFlow uses a device-level pipeline with execution unit scheduling, which partitions the device’s functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to work with different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8×7B, LLaMA-3-8B, etc. We show that NanoFlow achieves 68.5% of optimal throughput. With practical workloads, NanoFlow provides 1.91× throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.\n\n## Installation\n### Docker setup\n```bash\nmkdir -p ~/framework-test\ndocker run --gpus all --net=host --privileged -v /dev/shm:/dev/shm --name nanoflow -v ~/framework-test:/code -it nvcr.io/nvidia/nvhpc:23.11-devel-cuda_multi-ubuntu22.04\n```\n\n\u003e If using Runpod, we recommand using pytorch template 2.2.0.\n\n### Install dependencies\n```bash\ngit clone https://github.com/efeslab/Nanoflow.git\ncd Nanoflow\nchmod +x ./installAnaconda.sh\n./installAnaconda.sh\n# restart the terminal\n```\n\n```bash\nyes | ./setup.sh\n```\n\n### Serve different models\n```bash\n./serve.sh\n```\n![Nanoflow](./figures/serve.png)\n\n![Nanoflow](./figures/SampleOutput.png)\n\n\n## Evaluation\n\n```bash\n./perf.sh\n```\nResult figures can be found in `Nanoflow/pipeline/eval`.\n\n\n![Nanoflow](./figures/OfflineThroughput.png)\n\n## Citation\n\nIf you use NanoFlow for your research, please cite our [paper](https://arxiv.org/abs/2408.12757):\n```bibtex\n@misc{zhu2024nanoflowoptimallargelanguage,\n      title={NanoFlow: Towards Optimal Large Language Model Serving Throughput}, \n      author={Kan Zhu and Yilong Zhao and Liangyu Zhao and Gefei Zuo and Yile Gu and Dedong Xie and Yufei Gao and Qinyu Xu and Tian Tang and Zihao Ye and Keisuke Kamahori and Chien-Yu Lin and Stephanie Wang and Arvind Krishnamurthy and Baris Kasikci},\n      year={2024},\n      eprint={2408.12757},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC},\n      url={https://arxiv.org/abs/2408.12757}, \n}\n```\n\n## Acknowledgement\nNanoFlow is inspired by and reuses code from the following projects: [CUTLASS](https://github.com/NVIDIA/cutlass), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [MSCCL++](https://github.com/microsoft/mscclpp), and [Punica](https://github.com/punica-ai/punica). Development of NanoFlow is made easier thanks to these tools: [GoogleTest](https://github.com/google/googletest), [NVBench](https://github.com/NVIDIA/nvbench), and [spdlog](https://github.com/gabime/spdlog). We thank Siqin Chen for her help in the design of NanoFlow logo.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2Fnanoflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fefeslab%2Fnanoflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fefeslab%2Fnanoflow/lists"}