{"id":26381406,"url":"https://github.com/thu-pacman/chitu","last_synced_at":"2025-03-17T06:01:54.495Z","repository":{"id":282344941,"uuid":"935868283","full_name":"thu-pacman/chitu","owner":"thu-pacman","description":"High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.","archived":false,"fork":false,"pushed_at":"2025-03-14T04:26:19.000Z","size":4186,"stargazers_count":73,"open_issues_count":3,"forks_count":5,"subscribers_count":20,"default_branch":"public-main","last_synced_at":"2025-03-14T04:29:19.490Z","etag":null,"topics":["deepseek","gpu","llm","llm-serving","model-serving","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thu-pacman.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-20T06:34:38.000Z","updated_at":"2025-03-14T04:28:18.000Z","dependencies_parsed_at":"2025-03-14T04:39:24.299Z","dependency_job_id":null,"html_url":"https://github.com/thu-pacman/chitu","commit_stats":null,"previous_names":["thu-pacman/chitu"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-pacman%2Fchitu","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-pacman%2Fchitu/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-pacman%2Fchitu/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thu-pacman%2Fchitu/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thu-pacman","download_url":"https://codeload.github.com/thu-pacman/chitu/tar.gz/refs/heads/public-main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243982142,"owners_count":20378604,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deepseek","gpu","llm","llm-serving","model-serving","pytorch"],"created_at":"2025-03-17T06:01:09.713Z","updated_at":"2025-03-17T06:01:54.460Z","avatar_url":"https://github.com/thu-pacman.png","language":"Python","funding_links":[],"categories":["*Ops for AI","推理 Inference","Summary","Frameworks","A01_文本生成_文本对话","Python"],"sub_categories":["Model Serving \u0026 Inference","大语言对话模型及数据"],"readme":"# Chitu\n\nEnglish | [中文](docs/zh/README_zh.md)\n\nChitu is a high-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.\n\n## News\n\n[2025/03/14] Initial release of Chitu, support DeepSeek-R1 671B.\n\n## Introduction\n\nChitu is a high-performance inference framework for large language models. Chitu supports various mainstream large language models, including DeepSeek, LLaMA series, Mixtral, and more. We focus on the following goals:\n\n- **Efficiency**: We continue to develop and integrate latest optimizations for large language models, including GPU kernels, parallel strategies, quantizations and more.\n- **Flexibility**: We not only focus on the polular NVIDIA GPUs, but pay special attention to all kinds of hardware environments, including legacy GPUs, non-NVIDIA GPUs and CPUs. We aim to provide a versatile framework to encounter the diverse deploying requirements.\n- **Availability**: Chitu is ready and already deployed for real-world production.\n\n\nWelcome to join the [WeChat group](docs/assets/wechat_group.jpg) and stay tuned!\n\n\n## Performance Evaluation\n\nWe perform benchmarks on NVIDIA A800 40GB and H20 96GB GPUs and compare with vLLM.\n\n### Deploy DeepSeek-R1-671B on A800(40GB) cluster\n\n#### Comparison between Chitu and vLLM with multiple nodes\n\n|Hardware environment|6 nodes|6 nodes|3 nodes|\n|:---|:---|:---|:---|\n|Framework+precision|vllm 0.7.3, BF16|chitu 0.1.0, BF16|Chitu 0.1.0, FP8|\n|Use cuda graph|*OOM*|29.8 output token/s|22.7 output token/s|\n|Do not use cuda graph|6.85 output token/s|8.5 output token/s|7.0 output token/s|\n\n- Data in the table are all output throughput of single request (bs=1)\n- For Chitu For example, the output speed of the FP8 model running with 3 nodes is comparable to the speed of the BF16 model running with 6 nodes\n- Whether to use cuda graph has a significant impact on performance. The performance of the Chitu has been significantly improved after using cuda graph\n- During our evaluation, we encountered an out of memory error (OOM) when trying to run vLLM with cuda graph under a 6-node configuration. We are still solving this issue\n\n\u003cvideo src=\"https://github.com/user-attachments/assets/41495ac8-123d-4402-a6a8-0e0294b2edf4\" autoplay loop muted controls\u003e\n\u003c/video\u003e\n*This video was recorded earlier, and the performance data is slightly different from the released version*\n\n#### Comparison of BF16 and FP8 models running with Chitu\n\n|batchsize|6 nodes, BF16 |3 nodes, FP8|\n|:---|:---|:---|\n|1| 29.8 token/s| 22.7 token/s|\n|4| 78.8 token/s| 70.1 token/s|\n|8| 129.8 token/s| 108.9 token/s|\n|16| 181.4 token/s| 159.0 token/s|\n|32| 244.1 token/s| 214.5 token/s|\n\n- From the test data of different batch sizes, based on the Chitu engine, the output speed of the FP8 model running on 3 nodes is about 75%\\~90% of that of the BF16 model running on 6 nodes, that is, the output per unit computing power has been improved by 1.5x\\~1.8x\n- We believe that this is because the decoding process mainly depends on memory bandwidth. Using half of the GPU to access half of the data (the weight size of FP8 is half of that of BF16) will not take longer, and the reduction in GPU computing power will only have a small impact\n\n### Deploy DeepSeek-R1-671B on the H20 (96G) cluster\n\n#### Running on 2 nodes each with 8*H20 \n\n|Hardware environment|vllm 0.7.2, FP8|chitu 0.1.0, FP8|\n|:---|:---|:---|\n|bs=1, output token/s|21.16|22.1|\n|bs=16, output token/s|205.09|202.1|\n|bs=256, output token/s|1148.67|780.3|\n\n- With single request (bs=1), Chitu performs slightly better than vLLM\n- At medium batch size (bs=16), both systems show comparable performance\n- At large batch size (bs=256):\nvLLM achieves higher throughput, and we will optimize for large batch size in subsequent versions of Chitu.\n\n\n## Getting started\n\nYou can install Chitu from source.\n\n### Install from Source\n\n```bash\ngit clone --recursive https://github.com/thu-pacman/chitu \u0026\u0026 cd chitu\n\npip install -r requirements-build.txt\npip install -U torch --index-url https://download.pytorch.org/whl/cu124  # Change according to your CUDA version\nTORCH_CUDA_ARCH_LIST=8.6 CHITU_SETUP_JOBS=4 MAX_JOBS=4 pip install --no-build-isolation .\n```\n\n\n## Quick Start\n\n### Single GPU Inference\n\n```bash\ntorchrun --nproc_per_node 8 test/single_req_test.py request.max_new_tokens=64 models=DeepSeek-R1 models.ckpt_dir=/data/DeepSeek-R1 infer.pp_size=1 infer.tp_size=8\n```\n\n### Hybrid Parallelism (TP+PP)\n\n```bash\ntorchrun --nnodes 2 --nproc_per_node 8 test/single_req_test.py request.max_new_tokens=64 infer.pp_size=2 infer.tp_size=8 models=DeepSeek-R1 models.ckpt_dir=/data/DeepSeek-R1\n```\n\n### Start a Service\n\n```bash\n# Start service at localhost:21002\nexport WORLD_SIZE=8\ntorchrun --nnodes 1 \\\n    --nproc_per_node 8 \\\n    --master_port=22525 \\\n    chitu/serve.py \\\n    serve.port=21002 \\\n    infer.stop_with_eos=False \\\n    infer.cache_type=paged \\\n    infer.pp_size=1 \\\n    infer.tp_size=8 \\\n    models=DeepSeek-R1 \\\n    models.ckpt_dir=/data/DeepSeek-R1 \\\n    infer.attn_type=flash_infer \\\n    keep_dtype_in_checkpoint=True \\\n    infer.mla_absorb=absorb-without-precomp \\\n    infer.soft_fp8=True \\\n    infer.do_load=True \\\n    infer.max_reqs=1 \\\n    scheduler.prefill_first.num_tasks=100 \\\n    infer.max_seq_len=4096 \\\n    request.max_new_tokens=100 \\\n    infer.use_cuda_graph=True\n\n# Test the service\ncurl localhost:21002/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"messages\": [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": \"What is machine learning?\"\n      }\n    ]\n  }'\n```\n\n### Benchmarking\n\n```bash\n# Comprehensive performance testing with benchmark_serving tool\npython benchmarks/benchmark_serving.py \\\n    --model \"deepseek-r1\" \\\n    --iterations 10 \\\n    --seq-len 10 \\\n    --warmup 3 \\\n    --base-url http://localhost:21002\n```\n\n### Full Documentation\n\nPlease refer to [here](docs/Development.md) for more details.\n\n## FAQ (Frequently Asked Questions)\n\n[English](docs/en/FAQ.md) | [中文](docs/zh/FAQ.md)\n\n## Contributing\n\nWe welcome contributions! Please see our [Contributing Guide](docs/CONTRIBUTING.md) for details.\n\n## License\n\nThe Chitu Project is under the Apache License v2.0. - see the [LICENSE](LICENSE) file for details.\n\nThis repository also contains third_party submodules under other open source\nlicenses. You can find these submodules under third_party/ directory, which\ncontains their own license files.\n\n\n## Acknowledgment\n\nWe learned a lot from the following projects and adapted some functions when building Chitu:\n- [vLLM](https://github.com/vllm-project/vllm)\n- [SGLang](https://github.com/sgl-project/sglang)\n- [DeepSeek](https://github.com/deepseek-ai)\n\nSpecial thanks to our partners (Partners listed in no particular order): 中国电信、华为、沐曦、燧原、 etc.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-pacman%2Fchitu","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthu-pacman%2Fchitu","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthu-pacman%2Fchitu/lists"}