{"id":15054320,"url":"https://github.com/netease-media/grps_trtllm","last_synced_at":"2025-04-06T18:13:25.987Z","repository":{"id":254112814,"uuid":"845523645","full_name":"NetEase-Media/grps_trtllm","owner":"NetEase-Media","description":"Higher performance OpenAI LLM service than vLLM serve: A pure C++ high-performance OpenAI LLM service implemented with GPRS+TensorRT-LLM+Tokenizers.cpp, supporting chat and function call, AI agents, distributed multi-GPU inference, multimodal capabilities, and a Gradio chat interface.","archived":false,"fork":false,"pushed_at":"2025-03-25T11:57:53.000Z","size":132659,"stargazers_count":125,"open_issues_count":8,"forks_count":7,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-30T17:08:14.398Z","etag":null,"topics":["ai-agent","chatglm","deepseek-r1","function-call","internvideo","internvl2","janus-pro","llama-index","llama3","llm","minicpm-v","multi-modal","olmocr","openai","phi","qwen2","qwen2-vl","qwq","tensorrt-llm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NetEase-Media.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-21T12:18:03.000Z","updated_at":"2025-03-28T03:36:34.000Z","dependencies_parsed_at":"2024-09-29T03:01:09.369Z","dependency_job_id":"3a539c8f-f3d6-453f-86f8-7fde8645e8e4","html_url":"https://github.com/NetEase-Media/grps_trtllm","commit_stats":{"total_commits":73,"total_committers":2,"mean_commits":36.5,"dds":0.2191780821917808,"last_synced_commit":"e001d0cd2913d6ffd7be7fd8bf632b3cf2dcd0ea"},"previous_names":["netease-media/grps_trtllm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase-Media%2Fgrps_trtllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase-Media%2Fgrps_trtllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase-Media%2Fgrps_trtllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NetEase-Media%2Fgrps_trtllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NetEase-Media","download_url":"https://codeload.github.com/NetEase-Media/grps_trtllm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247526753,"owners_count":20953143,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","chatglm","deepseek-r1","function-call","internvideo","internvl2","janus-pro","llama-index","llama3","llm","minicpm-v","multi-modal","olmocr","openai","phi","qwen2","qwen2-vl","qwq","tensorrt-llm"],"created_at":"2024-09-24T21:38:39.595Z","updated_at":"2025-04-06T18:13:25.951Z","avatar_url":"https://github.com/NetEase-Media.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# grps-trtllm\n\n[GRPS](https://github.com/NetEase-Media/grps) + [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)\n实现纯```C++```版，相比```vllm serve```更优性能的```OpenAI LLM```服务，支持```Chat```、```Ai-agent```、```Multi-modal```\n、多卡推理等。\n\n![GRPS](https://img.shields.io/badge/GRPS-blue)\n![TensorRT-LLM](https://img.shields.io/badge/TensorRT_LLM-green)\n![Tokenizer-CPP](https://img.shields.io/badge/Tokenizer_CPP-blue)\n![OpenAI](https://img.shields.io/badge/OpenAI-green)\n![Ai-Agent](https://img.shields.io/badge/Ai_Agent-blue)\n![Multi-Modal](https://img.shields.io/badge/Multi_Modal-green)\n\n[快速开始](#快速开始) | [模型列表](#模型列表) | [镜像列表](./docs/images.md) | [性能](./docs/performance.md) | [更新历史](./docs/release_note.md) | [预告](./docs/next.md)\n\n\u003cdiv align=\"left\"\u003e\n\n## 演示\n\n\u003cimg src=\"docs/gradio.gif\" alt=\"gradio.gif\"\u003e\n\n## 说明\n\n[grps](https://github.com/NetEase-Media/grps)接入[trtllm](https://github.com/NVIDIA/TensorRT-LLM)\n实现更高性能的、支持```OpenAI```模式访问、支持```Ai-agent```以及多模态的```LLM```\n服务：\n\n* 通过纯```C++```实现完整```LLM```服务，包含```tokenizer```（支持`huggingface`, `sentencepiece`tokenizer）、```llm推理```\n  、```vit```等部分。\n* 通过```grps```的自定义```http```功能实现```OpenAI```接口协议，支持```chat```和```function call```模式。\n* 支持扩展不同```LLM```的```prompt```构建风格以及生成结果的解析风格，以实现不同```LLM```的```chat```\n  和```function call```模式，支持[llama-index](https://github.com/run-llama/llama_index)```ai-agent```。\n* 通过集成```tensorrt```推理后端与```opencv```库，支持多模态```LLM```。\n* 支持```inflight batching```、```multi-gpu```、```paged attention```、```kv-cache reuse```等```TensorRT-LLM```推理加速技术。\n* 相比较[triton tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend),\n  不存在```triton_server \u003c--\u003e tokenizer_backend \u003c--\u003e trtllm_backend```之间的进程间通信，纯C++实现，性能有稳定的提升。\n\n欢迎各位使用和提[issue](https://github.com/NetEase-Media/grps_trtllm/issues)\n，欢迎提交[pr](https://github.com/NetEase-Media/grps_trtllm/pulls)支持新的模型，感谢star⭐️。也可以添加微信沟通：zhaocc1218。\n\n## 文档教程\n\n* [快速开始](#快速开始)\n* [模型列表](#模型列表)\n* [采样参数配置](docs/sampling.md)\n* [调度策略配置](docs/scheduler.md)\n* [前缀缓存重用](docs/kv_reuse.md)\n* [启动gradio服务](docs/gradio.md)\n* [docker部署](docs/docker.md)\n* [性能比较](docs/performance.md)\n* [镜像列表](docs/images.md)\n* [压测](docs/benchmark.md)\n* [TODO](#todo)\n\n## 模型列表\n\n支持的文本LLM：\n\n| supported model                                                          | llm_styler  | chat | function_call | doc                                                  |\n|--------------------------------------------------------------------------|-------------|------|---------------|------------------------------------------------------|\n| DeepSeek-R1-Distill\u003cbr\u003eTinyR1-32B-Preview                                | deepseek-r1 | ✅    | ❌             | [deepseek-r1-distill](docs%2Fdeepseek-r1-distill.md) |\n| QwQ-32B\u003cbr\u003eQwQ-32B-AWQ                                                   | qwq         | ✅    | ✅             | [qwq](docs%2Fqwq.md)                                 |\n| QwQ-32B-Preview                                                          | qwq-preview | ✅    | ❌             | [qwq-preview](docs%2Fqwq-preview.md)                 |\n| Qwen2.5-1M\u003cbr\u003eQwen2.5-Coder\u003cbr\u003eQwen2.5-Math\u003cbr\u003eQwen2.5                   | qwen2.5     | ✅    | ✅             | [qwen2.5](docs%2Fqwen2.5.md)                         |\n| Qwen1.5-Chat\u003cbr\u003eQwen1.5-Moe-Chat\u003cbr\u003eQwen2-Instruct\u003cbr\u003eQwen2-Moe-Instruct | qwen        | ✅    | ✅             | [qwen2](docs%2Fqwen2.md)                             |\n| chatglm3                                                                 | chatglm3    | ✅    | ✅             | [chatglm3](docs%2Fchatglm3.md)                       |                                                     \n| glm4                                                                     | glm4        | ✅    | ✅             | [glm4](docs%2Fglm4.md)                               |\n| internlm2_5-chat\u003cbr\u003einternlm2-chat                                       | internlm2   | ✅    | ✅             | [internlm2.5](docs%2Finternlm2.5.md)                 |\n| llama-3-instruct\u003cbr\u003ellama-3.1-instruct                                   | llama3      | ✅    | ❌             | [llama3](docs%2Fllama3.md)                           |\n| phi-4                                                                    | phi4        | ✅    | ❌             | [phi4](docs%2Fphi4.md)                               |\n| Phi-3, Phi-3.5                                                           | phi3        | ✅    | ❌             | [phi3](docs%2Fphi3.md)                               |\n| gemma-3(experimental)                                                    | gemma3      | ✅    | ❌             | [gemma-3](docs%2Fgemma3.md)                          |\n\n支持的多模态LLM（少部分模型vit无法通过纯c++实现）：\n\n| supported model                               | llm_styler          | vit             | vit_type | chat | function_call | doc                                          |\n|-----------------------------------------------|---------------------|-----------------|----------|------|---------------|----------------------------------------------|\n| MiniCPM-V-2_6                                 | minicpmv            | minicpmv        | py       | ✅    | ❌             | [minicpmv](docs%2Fminicpmv.md)               |\n| Janus-Pro                                     | janus-pro           | janus-pro       | c++      | ✅    | ❌             | [janus-pro](docs%2Fjanus-pro.md)             |\n| InternVideo2.5                                | intern-video2.5     | intern-video2.5 | py       | ✅    | ❌             | [intern-video2.5](docs%2Fintern-video2.5.md) |\n| InternVL2_5\u003cbr\u003eInternVL2_5-MPO                | internvl2.5         | internvl2       | c++      | ✅    | ❌             | [internvl2.5](docs%2Finternvl2.5.md)         |\n| InternVL2-2B\u003cbr\u003eInternVL2-8B\u003cbr\u003eInternVL2-26B | internvl2-internlm2 | internvl2       | c++      | ✅    | ❌             | [internvl2](docs%2Finternvl2.md)             |\n| InternVL2-1B                                  | internvl2-qwen2     | internvl2       | c++      | ✅    | ❌             | [internvl2](docs%2Finternvl2.md)             |\n| InternVL2-4B                                  | internvl2-phi3      | internvl2       | c++      | ✅    | ❌             | [internvl2](docs%2Finternvl2.md)             |\n| olmOCR                                        | qwen2vl             | qwen2vl         | c++      | ✅    | ❌             | [olm-ocr](docs%2Folm-ocr.md)                 |\n| Qwen2-VL-Instruct                             | qwen2vl             | qwen2vl         | c++      | ✅    | ❌             | [qwen2vl](docs%2Fqwen2vl.md)                 |\n| Qwen-VL-Chat\u003cbr\u003eQwen-VL                       | qwenvl              | qwenvl          | c++      | ✅    | ❌             | [qwenvl](docs%2Fqwenvl.md)                   |\n\n## 工程结构\n\n```text\n|-- client                              # 客户端样例\n|-- conf                                # 配置文件\n|   |-- inference*.yml                  # 各类llm推理配置\n|   |-- server.yml                      # 服务配置\n|-- data                                # 数据文件\n|-- docker                              # docker镜像构建\n|-- docs                                # 文档\n|-- processors                          # 远程处理器\n|-- second_party                        # grps框架依赖\n|-- src                                 # 自定义源码\n|   |-- tensorrt                        # tensorrt推理后端\n|   |-- vit                             # vit实现\n|   |-- constants.cc/.h                 # 常量定义\n|   |-- customized_inferer.cc/.h        # 自定义推理器\n|   |-- llm_styler.cc/.h                # LLM风格定义，prompt构建，结果解析\n|   |-- tokenizer.cc/.h                 # Tokenizer实现\n|   |-- trtllm_model_instance.cc/.h     # TensorRT-LLM模型实例\n|   |-- trtllm_model_state.cc/.h        # TensorRT-LLM模型状态\n|   |-- utils.cc/.h                     # 工具\n|   |-- main.cc                         # 本地单元测试\n|-- third_party                         # 第三方依赖\n|-- tools                               # 工具\n|-- build.sh                            # 构建脚本\n|-- CMakelists.txt                      # 工程构建文件\n|-- .clang-format                       # 代码格式化配置文件\n|-- .config                             # 工程配置文件，包含一些工程配置开关\n```\n\n## 快速开始\n\n以qwen2.5-instruct为例。更多llm示例见[模型列表](#模型列表)，拉取代码与创建容器步骤相同。\n\n### 拉取代码\n\n```bash\ngit clone https://github.com/NetEase-Media/grps_trtllm.git\ncd grps_trtllm\ngit submodule update --init --recursive\n```\n\n### 创建容器\n\n使用```registry.cn-hangzhou.aliyuncs.com/opengrps/grps_gpu:grps1.1.0_cuda12.6_cudnn9.6_trtllm0.16.0_py3.12```镜像。\n这里挂载了当前目录用于构建工程并保留构建产物，挂载/tmp目录用于保存构建的trtllm引擎文件。参考```triton-trtllm```\n设置共享内存大小，解除物理内存锁定限制，设置栈大小，配置参数\n```--shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864```。\n\n```bash\n# 创建容器\ndocker run -itd --name grps_trtllm_dev --runtime=nvidia --network host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \\\n-v $(pwd):/grps_dev -v /tmp:/tmp -w /grps_dev \\\nregistry.cn-hangzhou.aliyuncs.com/opengrps/grps_gpu:grps1.1.0_cuda12.6_cudnn9.6_trtllm0.16.0_py3.12 bash\n# 进入开发容器\ndocker exec -it grps_trtllm_dev bash\n```\n\n### 构建trtllm引擎\n\n```bash\n# 下载Qwen2.5-7B-Instruct模型\napt update \u0026\u0026 apt install git-lfs\ngit lfs install\ngit clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct /tmp/Qwen2.5-7B-Instruct\n\n# 进入TensorRT-LLM/examples/qwen目录，参考README进行构建trtllm引擎。\ncd third_party/TensorRT-LLM/examples/qwen\n# 转换ckpt\nrm -rf /tmp/Qwen2.5-7B-Instruct/tllm_checkpoint/\npython3 convert_checkpoint.py --model_dir /tmp/Qwen2.5-7B-Instruct \\\n--output_dir /tmp/Qwen2.5-7B-Instruct/tllm_checkpoint/ --dtype bfloat16 --load_model_on_cpu\n# 构建引擎\nrm -rf /tmp/Qwen2.5-7B-Instruct/trt_engines/\ntrtllm-build --checkpoint_dir /tmp/Qwen2.5-7B-Instruct/tllm_checkpoint/ \\\n--output_dir /tmp/Qwen2.5-7B-Instruct/trt_engines/ \\\n--gemm_plugin bfloat16 --max_batch_size 16 --paged_kv_cache enable --use_paged_context_fmha enable \\\n--max_input_len 32256 --max_seq_len 32768 --max_num_tokens 32256\n# 运行测试\npython3 ../run.py --input_text \"你好，你是谁？\" --max_output_len=50 \\\n--tokenizer_dir /tmp/Qwen2.5-7B-Instruct/ \\\n--engine_dir=/tmp/Qwen2.5-7B-Instruct/trt_engines/\n# 回到工程根目录\ncd ../../../../\n```\n\n### 修改inference.yml配置\n\n修改llm对应的conf/inference*.yml中```inferer_args```相关参数。注意修改```tokenizer_path```\n和```gpt_model_path```为新路径，更多核心参数见如下：\n\n```yaml\nmodels:\n  - name: trtllm_model\n    ...\n    inferer_args:\n      # llm style used to build prompt(chat or function call) and parse generated response for openai interface.\n      # Support llm_style see README.md.\n      llm_style: qwen2.5\n\n      # tokenizer config.\n      tokenizer_type: huggingface # can be `huggingface`, `sentencepiece`. Must be set.\n      tokenizer_path: /tmp/Qwen2.5-7B-Instruct/ # path of tokenizer. Must be set.\n      tokenizer_parallelism: 16 # tokenizers count for parallel tokenization. Will be set to 1 if not set.\n      end_token_id: 151645 # end token id of tokenizer. Null if not set.\n      pad_token_id: 151643 # pad token id of tokenizer. Null if not set.\n      skip_special_tokens: # skip special tokens when decoding. Empty if not set.\n        - 151643 # \"\u003c|endoftext|\u003e\"\n        - 151644 # \"\u003c|im_start|\u003e\"\n        - 151645 # \"\u003c|im_end|\u003e\"\n        ...\n      force_tokens_dict: # will be used to force map tokens to ids when encode and decode instead of using tokenizer. Empty if not set.\n      #  - token: \"\u003c|endoftext|\u003e\"\n      #    id: 151643\n      prefix_tokens_id: # prefix tokens id will be added to the beginning of the input ids. Empty if not set.\n      suffix_tokens_id: # suffix tokens id will be added to the end of the input ids. Empty if not set.\n\n      # default sampling config, sampling param in request will overwrite these. Support sampling params see\n      # @ref(src/constants.h - SamplingConfig)\n      sampling:\n        top_k: 50\n        top_p: 1.0\n\n      # trtllm config.\n      gpt_model_type: inflight_fused_batching # must be `V1`(==`v1`) or `inflight_batching`(==`inflight_fused_batching`).\n      gpt_model_path: /tmp/Qwen2.5-7B-Instruct/trt_engines/ # path of decoder model. Must be set.\n      encoder_model_path: # path of encoder model. Null if not set.\n      stop_words: # additional stop words. Empty if not set.\n        - \"\u003c|im_start|\u003e\"\n        - \"\u003c|im_end|\u003e\"\n        - \"\u003c|endoftext|\u003e\"\n      bad_words: # additional bad words. Empty if not set.\n      batch_scheduler_policy: guaranteed_no_evict # must be `max_utilization` or `guaranteed_no_evict`.\n      kv_cache_free_gpu_mem_fraction: 0.9 # will be set to 0.9 or `max_tokens_in_paged_kv_cache` if not set.\n      exclude_input_in_output: true # will be set to false if not set.\n```\n\n### 构建与部署\n\n```bash\n# 构建\ngrpst archive .\n\n# 部署，\n# 通过--inference_conf参数指定模型对应的inference.yml配置文件启动服务。\n# 如需修改服务端口，并发限制等，可以修改conf/server.yml文件，然后启动时指定--server_conf参数指定新的server.yml文件。\n# 注意如果使用多卡推理，需要使用mpi方式启动，--mpi_np参数为并行推理的GPU数量。\ngrpst start ./server.mar --inference_conf=conf/inference_qwen2.5.yml\n\n# 查看服务状态\ngrpst ps\n# 如下输出\nPORT(HTTP,RPC)      NAME                PID                 DEPLOY_PATH         \n9997                my_grps             65322               /home/appops/.grps/my_grps\n```\n\n### 模拟请求\n\n```bash\n# curl命令非stream请求\ncurl --no-buffer http://127.0.0.1:9997/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"qwen2.5-instruct\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"你好，你是谁？\"\n      }\n    ]\n  }'\n# 返回如下：\n: '\n{\n \"id\": \"chatcmpl-7\",\n \"object\": \"chat.completion\",\n \"created\": 1726733862,\n \"model\": \"qwen2.5-instruct\",\n \"system_fingerprint\": \"grps-trtllm-server\",\n \"choices\": [\n  {\n   \"index\": 0,\n   \"message\": {\n    \"role\": \"assistant\",\n    \"content\": \"你好！我是Qwen，由阿里云开发的人工智能模型。我被设计用来提供信息、回答问题和进行各种对话任务。有什么我可以帮助你的吗？\"\n   },\n   \"logprobs\": null,\n   \"finish_reason\": \"stop\"\n  }\n ],\n \"usage\": {\n  \"prompt_tokens\": 34,\n  \"completion_tokens\": 36,\n  \"total_tokens\": 70\n }\n}\n'\n\n# curl命令stream请求\ncurl --no-buffer http://127.0.0.1:9997/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"qwen2.5-instruct\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"你好，你是谁？\"\n      }\n    ],\n    \"stream\": true\n  }'\n# 返回如下：\n: '\ndata: {\"id\":\"chatcmpl-8\",\"object\":\"chat.completion.chunk\",\"created\":1726733878,\"model\":\"qwen2.5-instruct\",\"system_fingerprint\":\"grps-trtllm-server\",\"choices\":[{\"index\":0,\"delta\":{\"role\":\"assistant\",\"content\":\"你好\"},\"logprobs\":null,\"finish_reason\":null}]}\ndata: {\"id\":\"chatcmpl-8\",\"object\":\"chat.completion.chunk\",\"created\":1726733878,\"model\":\"qwen2.5-instruct\",\"system_fingerprint\":\"grps-trtllm-server\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"！\"},\"logprobs\":null,\"finish_reason\":null}]}\ndata: {\"id\":\"chatcmpl-8\",\"object\":\"chat.completion.chunk\",\"created\":1726733878,\"model\":\"qwen2.5-instruct\",\"system_fingerprint\":\"grps-trtllm-server\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"我是\"},\"logprobs\":null,\"finish_reason\":null}]}\n'\n\n# 测试stop参数\ncurl --no-buffer http://127.0.0.1:9997/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"qwen2.5-instruct\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"重复1234#END#5678\"\n      }\n    ],\n    \"stop\": [\"#END#\"]\n  }'\n# 返回如下：\n: '\n{\n \"id\": \"chatcmpl-2\",\n \"object\": \"chat.completion\",\n \"created\": 1727433345,\n \"model\": \"qwen2.5-instruct\",\n \"system_fingerprint\": \"grps-trtllm-server\",\n \"choices\": [\n  {\n   \"index\": 0,\n   \"message\": {\n    \"role\": \"assistant\",\n    \"content\": \"1234#END#\"\n   },\n   \"logprobs\": null,\n   \"finish_reason\": \"stop\"\n  }\n ],\n \"usage\": {\n  \"prompt_tokens\": 41,\n  \"completion_tokens\": 7,\n  \"total_tokens\": 48\n }\n}\n'\n\n# openai_cli.py 非stream请求\npython3 client/openai_cli.py 127.0.0.1:9997 \"你好，你是谁？\" false\n# 返回如下：\n: '\nChatCompletion(id='chatcmpl-9', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='你好！我是Qwen，由阿里云开发的人工智能模型。我被设计用来提供信息、回答问题和进行各种对话任务。有什么我可以帮助你的吗？', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1726733895, model='', object='chat.completion', service_tier=None, system_fingerprint='grps-trtllm-server', usage=CompletionUsage(completion_tokens=36, prompt_tokens=34, total_tokens=70, completion_tokens_details=None))\n'\n\n# openai_cli.py stream请求\npython3 client/openai_cli.py 127.0.0.1:9997 \"你好，你是谁？\" true\n# 返回如下：\n: '\nChatCompletionChunk(id='chatcmpl-10', choices=[Choice(delta=ChoiceDelta(content='你好', function_call=None, refusal=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1726733914, model='', object='chat.completion.chunk', service_tier=None, system_fingerprint='grps-trtllm-server', usage=None)\nChatCompletionChunk(id='chatcmpl-10', choices=[Choice(delta=ChoiceDelta(content='！', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1726733914, model='', object='chat.completion.chunk', service_tier=None, system_fingerprint='grps-trtllm-server', usage=None)\nChatCompletionChunk(id='chatcmpl-10', choices=[Choice(delta=ChoiceDelta(content='我是', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1726733914, model='', object='chat.completion.chunk', service_tier=None, system_fingerprint='grps-trtllm-server', usage=None)\n'\n\n# 输入32k长文本小说验证长文本的支持\npython3 client/openai_txt_cli.py 127.0.0.1:9997 ./data/32k_novel.txt \"上面这篇小说作者是谁？\" false\n# 返回如下：\n: '\nChatCompletion(id='chatcmpl-11', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='这篇小说的作者是弦三千。', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1726733931, model='', object='chat.completion', service_tier=None, system_fingerprint='grps-trtllm-server', usage=CompletionUsage(completion_tokens=8, prompt_tokens=31615, total_tokens=31623, completion_tokens_details=None))\n'\n\n# 输入32k长文本小说进行总结\npython3 client/openai_txt_cli.py 127.0.0.1:9997 ./data/32k_novel.txt \"简述一下上面这篇小说的前几章内容。\" false\n# 返回如下：\n: '\nChatCompletion(id='chatcmpl-12', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='以下是《拜托，只想干饭的北极熊超酷的！》前几章的主要内容概述：\\n\\n1. **第一章**：楚云霁意外穿越成了一只北极熊，他发现了一群科考队，并用鱼与他们交流。楚云霁在暴风雪中艰难生存，通过抓鱼和捕猎海豹来获取食物。\\n\\n2. **第二章**：楚云霁在暴风雪后继续捕猎，遇到了一只北极白狼。白狼似乎对楚云霁很友好，甚至带他去捕猎海豹。楚云霁吃了一顿饱饭后，与白狼一起回到白狼的洞穴休息。\\n\\n3. **第三章**：楚云霁在白狼的洞穴中休息，醒来后发现白狼已经离开。他继续捕猎，遇到了一群海豹，但海豹很快被一只成年北极熊吓跑。楚云霁在冰面上发现了一群生蚝，但白狼对生蚝不感兴趣，楚云霁只好自己吃了。\\n\\n4. **第四章**：楚云霁在捕猎时遇到了一只成年北极熊，成年北极熊似乎在挑衅他。楚云霁和白狼一起捕猎了一只驯鹿，分享了食物。直播设备记录下了这一幕，引起了观众的热议。\\n\\n5. **第五章**：楚云霁和白狼一起捕猎了一只驯鹿，分享了食物。楚云霁在捕猎时遇到了一只北极狐，但北极狐被北极熊吓跑。楚云霁还遇到了一只海鸟，海鸟试图抢食，但被白狼赶走。楚云霁和白狼一起处理了一只驯鹿，白狼还帮助楚云霁取下了鹿角。\\n\\n6. **第六章**：楚云霁和白狼一起捕猎，楚云霁在冰面上睡觉时被冰面漂走。醒来后，楚云霁发现白狼还在身边，感到非常高兴。他们一起捕猎了一只海象，但海象偷走了鱼竿。楚云霁和白狼一起追捕海象，最终成功捕获了海象。\\n\\n7. **第七章**：楚云霁和白狼一起捕猎，楚云霁发现了一根鱼竿。他们一起用鱼竿钓鱼，但鱼竿被海象带走。楚云霁和白狼一起追捕海象，最终成功捕获了海象。楚云霁和白狼一起分享了海象肉。\\n\\n8. **第八章**：楚云霁和白狼一起捕猎，楚云霁发现了一根鱼竿。他们一起用鱼竿钓鱼，但鱼竿被海象带走。楚云霁和白狼一起追捕海象，最终成功捕获了海象。楚云霁和白狼一起分享了海象肉。\\n\\n9. **第九章**：楚云霁和白狼一起捕猎，楚云霁发现了一根鱼竿。他们一起用鱼竿钓鱼，但鱼竿被海象带走。楚云霁和白狼一起追捕海象，最终成功捕获了海象。楚云霁和白狼一起分享了海象肉。\\n\\n10. **第十章**：楚云霁和白狼一起捕猎，楚云霁发现了一根鱼竿。他们一起用鱼竿钓鱼，但鱼竿被海象带走。楚云霁和白狼一起追捕海象，最终成功捕获了海象。楚云霁和白狼一起分享了海象肉。\\n\\n11. **第十一章**：楚云霁在白狼的洞穴中发现了一个背包，背包里装满了各种食物和补给品。楚云霁和白狼一起分享了这些食物，包括罐头和海带。楚云霁还和白狼一起出去捕猎，但没有成功。\\n\\n12. **第十二章**：楚云霁和白狼一起出去捕猎，楚云霁发现了一根鱼竿。他们一起用鱼竿钓鱼，但鱼竿被海象带走。楚云霁和白狼一起追捕海象，最终成功捕获了海象。楚云霁和白狼一起分享了海象肉，并一起出去探索周围的环境。楚云霁还发现了一个背包，背包里装满了各种食物和补给品。楚云霁和白狼一起分享了这些食物，包括罐头和海带。楚云霁还和白狼一起出去捕猎，但没有成功。', refusal=None, role='assistant', function_call=None, tool_calls=None))], created=1726733966, model='', object='chat.completion', service_tier=None, system_fingerprint='grps-trtllm-server', usage=CompletionUsage(completion_tokens=959, prompt_tokens=31621, total_tokens=32580, completion_tokens_details=None))\n'\n\n# openai_func_call.py进行function call模拟\npython3 client/openai_func_call.py 127.0.0.1:9997\n# 返回如下：\n: '\nQuery server with question: What's the weather like in Boston today? ...\nServer response: thought: None, call local function(get_current_weather) with arguments: location=Boston, MA, unit=fahrenheit\nSend the result back to the server with function result(59.0) ...\nFinal server response: The current temperature in Boston today is 59°F.\n'\n\n# openai_func_call2.py进行一次两个函数的function call模拟\npython3 client/openai_func_call2.py 127.0.0.1:9997\n# 返回如下：\n: '\nQuery server with question: What's the postcode of Boston and what's the weather like in Boston today? ...\nServer response: thought: None, call local function(get_postcode) with arguments: location=Boston, MA\nServer response: thought: None, call local function(get_current_weather) with arguments: location=Boston, MA, unit=fahrenheit\nSend the result back to the server with function result ...\nFinal server response: The postcode for Boston, MA is 02138. The current temperature in Boston today is 59.0°F.\n'\n\n# llama-index ai agent模拟\npip install llama_index llama_index.llms.openai_like\npython3 client/llamaindex_ai_agent.py 127.0.0.1:9997\n# 返回如下：\n: '\nQuery: What is the weather in Boston today?\nAdded user message to memory: What is the weather in Boston today?\n=== Calling Function ===\nCalling function: get_weather with args: {\"location\":\"Boston, MA\",\"unit\":\"fahrenheit\"}\nGot output: 59.0\n========================\n\nResponse: The current temperature in Boston is 59.0 degrees Fahrenheit.\n'\n```\n\n### 指标观测\n\n通过访问```http://ip:9997/``` 可以查看服务的指标信息。如下指标：\n\n![metrics_0.png](docs/metrics_0.png)\u003cbr\u003e\n![metrics_1.png](docs/metrics_1.png)\n\n### 关闭服务\n\n```bash\n# 关闭服务\ngrpst stop my_grps\n```\n\n## TODO\n\n* 当前基于```tensorrt-llm v0.10.0```之后的版本进行的实现，最新支持到```v0.16.0```\n  （主分支），具体见仓库的分支信息。由于人力受限，一些bug不能及时在每一个分支修复，请尽量使用最新版本分支。\n* 由于不同家族系的```LLM```的```chat```和```function call```\n  的```prompt```构建以及结果解析风格不同，所以需要实现不同```LLM```家族的```styler```，见```src/llm_styler.cc/.h```\n  ，用户可以自行扩展。拓展后需要修改```conf/inference.yml```的```llm_style```为对应的家族名。\n  不同家族的```styler```持续开发中...。\n* 不同多模态模型的```vit```实现不同，见```src/vit```，用户可以自行扩展。拓展后需要修改```conf/inference.yml```\n  的```vit_type```为对应的类型名。\n  不同多模态模型的```vit```持续开发中...。\n* 书写用户自定义拓展```llm_styler```与```vit```开发文档。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease-media%2Fgrps_trtllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnetease-media%2Fgrps_trtllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnetease-media%2Fgrps_trtllm/lists"}