{"id":23345702,"url":"https://github.com/triton-inference-server/tensorrtllm_backend","last_synced_at":"2025-05-15T04:04:25.425Z","repository":{"id":201701413,"uuid":"690665848","full_name":"triton-inference-server/tensorrtllm_backend","owner":"triton-inference-server","description":"The Triton TensorRT-LLM Backend","archived":false,"fork":false,"pushed_at":"2025-05-14T18:38:02.000Z","size":1504,"stargazers_count":833,"open_issues_count":331,"forks_count":122,"subscribers_count":22,"default_branch":"main","last_synced_at":"2025-05-14T19:30:50.320Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/triton-inference-server.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-12T16:18:28.000Z","updated_at":"2025-05-13T10:09:40.000Z","dependencies_parsed_at":"2023-12-15T13:46:56.614Z","dependency_job_id":"4fc965d2-f2fe-4842-af3c-37be5ed27fb4","html_url":"https://github.com/triton-inference-server/tensorrtllm_backend","commit_stats":null,"previous_names":["triton-inference-server/tensorrtllm_backend"],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triton-inference-server%2Ftensorrtllm_backend","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triton-inference-server%2Ftensorrtllm_backend/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triton-inference-server%2Ftensorrtllm_backend/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triton-inference-server%2Ftensorrtllm_backend/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/triton-inference-server","download_url":"https://codeload.github.com/triton-inference-server/tensorrtllm_backend/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254270641,"owners_count":22042858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-21T07:01:22.539Z","updated_at":"2025-05-15T04:04:25.398Z","avatar_url":"https://github.com/triton-inference-server.png","language":"Python","funding_links":[],"categories":["Frameworks","Summary","A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\u003c!--\n# Copyright 2024-2025, NVIDIA CORPORATION \u0026 AFFILIATES. All rights reserved.\n#\n# Redistribution and use in source and binary forms, with or without\n# modification, are permitted provided that the following conditions\n# are met:\n#  * Redistributions of source code must retain the above copyright\n#    notice, this list of conditions and the following disclaimer.\n#  * Redistributions in binary form must reproduce the above copyright\n#    notice, this list of conditions and the following disclaimer in the\n#    documentation and/or other materials provided with the distribution.\n#  * Neither the name of NVIDIA CORPORATION nor the names of its\n#    contributors may be used to endorse or promote products derived\n#    from this software without specific prior written permission.\n#\n# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY\n# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\n# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR\n# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,\n# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,\n# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR\n# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY\n# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\n# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n--\u003e\n\n# TensorRT-LLM Backend\nThe Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).\nYou can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).\nThe goal of TensorRT-LLM Backend is to let you serve [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)\nmodels with Triton Inference Server. The [inflight_batcher_llm](./inflight_batcher_llm/)\ndirectory contains the C++ implementation of the backend supporting inflight\nbatching, paged attention and more.\n\nWhere can I ask general questions about Triton and Triton backends?\nBe sure to read all the information below as well as the [general\nTriton documentation](https://github.com/triton-inference-server/server#triton-inference-server)\navailable in the main [server](https://github.com/triton-inference-server/server)\nrepo. If you don't find your answer there you can ask questions on the\n[issues page](https://github.com/triton-inference-server/tensorrtllm_backend/issues).\n\n## Table of Contents\n- [TensorRT-LLM Backend](#tensorrt-llm-backend)\n  - [Table of Contents](#table-of-contents)\n  - [Getting Started](#getting-started)\n    - [Quick Start](#quick-start)\n      - [Launch Triton TensorRT-LLM container](#launch-triton-tensorrt-llm-container)\n      - [Prepare TensorRT-LLM engines](#prepare-tensorrt-llm-engines)\n      - [Prepare the Model Repository](#prepare-the-model-repository)\n      - [Modify the Model Configuration](#modify-the-model-configuration)\n      - [Serving with Triton](#serving-with-triton)\n      - [Send an Inference Request](#send-an-inference-request)\n        - [Using the generate endpoint](#using-the-generate-endpoint)\n        - [Using the client scripts](#using-the-client-scripts)\n          - [Early stopping](#early-stopping)\n          - [Return context logits and/or generation logits](#return-context-logits-andor-generation-logits)\n        - [Requests with batch size \\\u003e 1](#requests-with-batch-size--1)\n  - [Building from Source](#building-from-source)\n  - [Supported Models](#supported-models)\n  - [Model Config](#model-config)\n  - [Model Deployment](#model-deployment)\n    - [TRT-LLM Multi-instance Support](#trt-llm-multi-instance-support)\n      - [Leader Mode](#leader-mode)\n      - [Orchestrator Mode](#orchestrator-mode)\n      - [Running Multiple Instances of LLaMa Model](#running-multiple-instances-of-llama-model)\n    - [Multi-node Support](#multi-node-support)\n    - [Model Parallelism](#model-parallelism)\n      - [Tensor Parallelism, Pipeline Parallelism and Expert Parallelism](#tensor-parallelism-pipeline-parallelism-and-expert-parallelism)\n    - [MIG Support](#mig-support)\n    - [Scheduling](#scheduling)\n    - [Key-Value Cache](#key-value-cache)\n    - [Decoding](#decoding)\n      - [Decoding Modes - Top-k, Top-p, Top-k Top-p, Beam Search, Medusa, ReDrafter, Lookahead and Eagle](#decoding-modes---top-k-top-p-top-k-top-p-beam-search-medusa-redrafter-lookahead-and-eagle)\n      - [Speculative Decoding](#speculative-decoding)\n    - [Chunked Context](#chunked-context)\n    - [Quantization](#quantization)\n    - [LoRa](#lora)\n  - [Launch Triton server *within Slurm based clusters*](#launch-triton-server-within-slurm-based-clusters)\n    - [Prepare some scripts](#prepare-some-scripts)\n    - [Submit a Slurm job](#submit-a-slurm-job)\n  - [Triton Metrics](#triton-metrics)\n  - [Benchmarking](#benchmarking)\n  - [Testing the TensorRT-LLM Backend](#testing-the-tensorrt-llm-backend)\n\n## Getting Started\n\n### Quick Start\n\nBelow is an example of how to serve a TensorRT-LLM model with the Triton\nTensorRT-LLM Backend on a 4-GPU environment. The example uses the GPT model from\nthe\n[TensorRT-LLM repository](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.11.0/examples/gpt)\nwith the\n[NGC Triton TensorRT-LLM container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).\nMake sure you are cloning the same version of TensorRT-LLM backend as the\nversion of TensorRT-LLM in the container. Please refer to the\n[support matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)\nto see the aligned versions.\n\nIn this example, we will use Triton 24.07 with TensorRT-LLM v0.11.0.\n\n\n#### Launch Triton TensorRT-LLM container\n\nLaunch Triton docker container `nvcr.io/nvidia/tritonserver:\u003cxx.yy\u003e-trtllm-python-py3`\nwith TensorRT-LLM backend.\n\nMake an `engines` folder outside docker to reuse engines for future runs. Make\nsure to replace the `\u003cxx.yy\u003e` with the version of Triton that you want to use.\n\n```bash\ndocker run --rm -it --net host --shm-size=2g \\\n    --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \\\n    -v \u003c/path/to/engines\u003e:/engines \\\n    nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3\n```\n\n#### Prepare TensorRT-LLM engines\n\nYou can skip this step if you already have the engines ready.\nFollow the [guide](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) in\nTensorRT-LLM repository for more details on how to to prepare the engines for\nall the supported models. You can also check out the\n[tutorials](https://github.com/triton-inference-server/tutorials) to see more\nexamples with serving TensorRT-LLM models.\n\n```bash\ncd /app/tensorrt_llm/examples/models/core/gpt\n\n# Download weights from HuggingFace Transformers\nrm -rf gpt2 \u0026\u0026 git clone https://huggingface.co/gpt2-medium gpt2\npushd gpt2 \u0026\u0026 rm pytorch_model.bin model.safetensors \u0026\u0026 wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin \u0026\u0026 popd\n\n# Convert weights from HF Tranformers to TensorRT-LLM checkpoint\npython3 convert_checkpoint.py --model_dir gpt2 \\\n        --dtype float16 \\\n        --tp_size 4 \\\n        --output_dir ./c-model/gpt2/fp16/4-gpu\n\n# Build TensorRT engines\ntrtllm-build --checkpoint_dir ./c-model/gpt2/fp16/4-gpu \\\n        --gpt_attention_plugin float16 \\\n        --remove_input_padding enable \\\n        --kv_cache_type paged \\\n        --gemm_plugin float16 \\\n        --output_dir /engines/gpt/fp16/4-gpu\n```\n\nSee [here](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gpt) for\nmore details on the parameters.\n\n#### Prepare the Model Repository\n\nNext, create the\n[model repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)\nthat will be used by the Triton server. The models can be found in the\n[all_models](./all_models) folder. The folder contains two groups of models:\n- [`gpt`](./all_models/gpt): Using TensorRT-LLM pure Python runtime.\n- [`inflight_batcher_llm`](./all_models/inflight_batcher_llm/)`: Using the C++\nTensorRT-LLM backend with the executor API, which includes the latest features\nincluding inflight batching.\n\nThere are five models in\n[all_models/inflight_batcher_llm](./all_models/inflight_batcher_llm) that will\nbe used in this example:\n\n| Model | Description |\n| :------------: | :---------------: |\n| `ensemble` | This model is used to chain the preprocessing, tensorrt_llm and postprocessing models together. |\n| `preprocessing` | This model is used for tokenizing, meaning the conversion from prompts(string) to input_ids(list of ints). |\n| `tensorrt_llm` | This model is a wrapper of your TensorRT-LLM model and is used for inferencing. Input specification can be found [here](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/inference-request.md) |\n| `postprocessing` | This model is used for de-tokenizing, meaning the conversion from output_ids(list of ints) to outputs(string). |\n| `tensorrt_llm_bls` | This model can also be used to chain the preprocessing, tensorrt_llm and postprocessing models together. |\n\nTo learn more about ensemble and BLS models, please see the\n[Ensemble Models](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)\nand\n[Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting)\ndocumentation.\n\nTo learn more about the benefits and the limitations of using the BLS model,\nplease see the [model config](./docs/model_config.md#tensorrt_llm_bls-model) section.\n\n```bash\nmkdir /triton_model_repo\ncp -r /app/all_models/inflight_batcher_llm/* /triton_model_repo/\n```\n\n#### Modify the Model Configuration\nUse the script to fill in the parameters in the model configuration files. For\noptimal performance or custom parameters, please refer to\n[perf_best_practices](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md).\nFor more details on the model configuration and the parameters that can be\nmodified, please refer to the [model config](./docs/model_config.md) section.\n\n```bash\nENGINE_DIR=/engines/gpt/fp16/4-gpu\nTOKENIZER_DIR=/app/tensorrt_llm/examples/models/core/gpt/gpt2\nMODEL_FOLDER=/triton_model_repo\nTRITON_MAX_BATCH_SIZE=4\nINSTANCE_COUNT=1\nMAX_QUEUE_DELAY_MS=0\nMAX_QUEUE_SIZE=0\nFILL_TEMPLATE_SCRIPT=/app/tools/fill_template.py\nDECOUPLED_MODE=false\nLOGITS_DATATYPE=TYPE_FP32\n\npython3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/ensemble/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},logits_datatype:${LOGITS_DATATYPE}\npython3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/preprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}\npython3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},engine_dir:${ENGINE_DIR},max_queue_delay_microseconds:${MAX_QUEUE_DELAY_MS},batching_strategy:inflight_fused_batching,max_queue_size:${MAX_QUEUE_SIZE},encoder_input_features_data_type:TYPE_FP16,logits_datatype:${LOGITS_DATATYPE}\npython3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/postprocessing/config.pbtxt tokenizer_dir:${TOKENIZER_DIR},triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}\npython3 ${FILL_TEMPLATE_SCRIPT} -i ${MODEL_FOLDER}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${TRITON_MAX_BATCH_SIZE},decoupled_mode:${DECOUPLED_MODE},bls_instance_count:${INSTANCE_COUNT},logits_datatype:${LOGITS_DATATYPE}\n```\n\n\u003e **NOTE**:\nIt is recommended to match the number of pre/post_instance_counts with triton_max_batch_size for better performance.\n\n#### Serving with Triton\n\nNow, you're ready to launch the Triton server with the TensorRT-LLM model.\n\nUse the launch_triton_server.py script. This launches multiple instances of tritonserver with MPI.\n\n```bash\n# 'world_size' is the number of GPUs you want to use for serving. This should\n# be aligned with the number of GPUs used to build the TensorRT-LLM engine.\npython3 /app/scripts/launch_triton_server.py --world_size=4 --model_repo=${MODEL_FOLDER}\n```\n\nYou should see the following logs when the server is successfully deployed.\n\n```bash\n...\nI0503 22:01:25.210518 1175 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:8001\nI0503 22:01:25.211612 1175 http_server.cc:4692] Started HTTPService at 0.0.0.0:8000\nI0503 22:01:25.254914 1175 http_server.cc:362] Started Metrics Service at 0.0.0.0:8002\n```\n\nTo stop Triton Server inside the container, run:\n\n```bash\npkill tritonserver\n```\n\n#### Send an Inference Request\n\n##### Using the [generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)\n\nThe general format of the generate endpoint:\n```bash\ncurl -X POST localhost:8000/v2/models/${MODEL_NAME}/generate -d '{\"{PARAM1_KEY}\": \"{PARAM1_VALUE}\", ... }'\n```\n\nIn the case of the models used in this example, you can replace MODEL_NAME with\n`ensemble` or `tensorrt_llm_bls`. Examining the ensemble and tensorrt_llm_bls\nmodel's config.pbtxt file, you can see that 4 parameters are required to\ngenerate a response for this model:\n\n- text_input: Input text to generate a response from\n- max_tokens: The number of requested output tokens\n- bad_words: A list of bad words (can be empty)\n- stop_words: A list of stop words (can be empty)\n\nTherefore, we can query the server in the following way:\n\n- if using the ensemble model\n```bash\ncurl -X POST localhost:8000/v2/models/ensemble/generate -d '{\"text_input\": \"What is machine learning?\", \"max_tokens\": 20, \"bad_words\": \"\", \"stop_words\": \"\"}'\n```\n\n- if using the tensorrt_llm_bls model\n\n```bash\ncurl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{\"text_input\": \"What is machine learning?\", \"max_tokens\": 20, \"bad_words\": \"\", \"stop_words\": \"\"}'\n```\n\nWhich should return a result similar to (formatted for readability):\n```bash\n{\n  \"model_name\": \"ensemble\",\n  \"model_version\": \"1\",\n  \"sequence_end\": false,\n  \"sequence_id\": 0,\n  \"sequence_start\": false,\n  \"text_output\": \"What is machine learning?\\n\\nMachine learning is a method of learning by using machine learning algorithms to solve problems.\\n\\n\"\n}\n```\n\n##### Using the client scripts\n\nYou can refer to the client scripts in the\n[inflight_batcher_llm/client](./inflight_batcher_llm/client) to see how to send\nrequests via Python scripts.\n\nBelow is an example of using\n[inflight_batcher_llm_client](./inflight_batcher_llm/client/inflight_batcher_llm_client.py)\nto send requests to the `tensorrt_llm` model.\n\n```bash\npip3 install tritonclient[all]\nINFLIGHT_BATCHER_LLM_CLIENT=/app/inflight_batcher_llm/client/inflight_batcher_llm_client.py\npython3 ${INFLIGHT_BATCHER_LLM_CLIENT} --request-output-len 200 --tokenizer-dir ${TOKENIZER_DIR}\n```\n\nThe result should be similar to the following:\n\n```bash\nUsing pad_id:  50256\nUsing end_id:  50256\nInput sequence:  [28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257]\nGot completed request\nInput: Born in north-east France, Soyer trained as a\nOutput beam 0:  chef before moving to London in the early 1990s. He has since worked in restaurants in London, Paris, Milan and New York.\n\nHe is married to the former model and actress, Anna-Marie, and has two children, a daughter, Emma, and a son, Daniel.\n\nSoyer's wife, Anna-Marie, is a former model and actress.\n\nHe is survived by his wife, Anna-Marie, and their two children, Daniel and Emma.\n\nSoyer was born in the north-east of France, and moved to London in the early 1990s.\n\nHe was a chef at the London restaurant, The Bistro, before moving to New York in the early 2000s.\n\nHe was a regular at the restaurant, and was also a regular at the restaurant, The Bistro, before moving to London in the early 2000s.\n\nSoyer was a regular at the restaurant, and was\nOutput sequence:  [28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257, 21221, 878, 3867, 284, 3576, 287, 262, 1903, 6303, 82, 13, 679, 468, 1201, 3111, 287, 10808, 287, 3576, 11, 6342, 11, 21574, 290, 968, 1971, 13, 198, 198, 1544, 318, 6405, 284, 262, 1966, 2746, 290, 14549, 11, 11735, 12, 44507, 11, 290, 468, 734, 1751, 11, 257, 4957, 11, 18966, 11, 290, 257, 3367, 11, 7806, 13, 198, 198, 50, 726, 263, 338, 3656, 11, 11735, 12, 44507, 11, 318, 257, 1966, 2746, 290, 14549, 13, 198, 198, 1544, 318, 11803, 416, 465, 3656, 11, 11735, 12, 44507, 11, 290, 511, 734, 1751, 11, 7806, 290, 18966, 13, 198, 198, 50, 726, 263, 373, 4642, 287, 262, 5093, 12, 23316, 286, 4881, 11, 290, 3888, 284, 3576, 287, 262, 1903, 6303, 82, 13, 198, 198, 1544, 373, 257, 21221, 379, 262, 3576, 7072, 11, 383, 347, 396, 305, 11, 878, 3867, 284, 968, 1971, 287, 262, 1903, 4751, 82, 13, 198, 198, 1544, 373, 257, 3218, 379, 262, 7072, 11, 290, 373, 635, 257, 3218, 379, 262, 7072, 11, 383, 347, 396, 305, 11, 878, 3867, 284, 3576, 287, 262, 1903, 4751, 82, 13, 198, 198, 50, 726, 263, 373, 257, 3218, 379, 262, 7072, 11, 290, 373]\n```\n\n###### Early stopping\n\nYou can also stop the generation process early by using the `--stop-after-ms`\noption to send a stop request after a few milliseconds:\n\n```bash\npython3 ${INFLIGHT_BATCHER_LLM_CLIENT} --stop-after-ms 200 --request-output-len 200 --request-id 1 --tokenizer-dir ${TOKENIZER_DIR}\n```\n\nYou will find that the generation process is stopped early and therefore the\nnumber of generated tokens is lower than 200. You can have a look at the\nclient code to see how early stopping is achieved.\n\n###### Return context logits and/or generation logits\n\nIf you want to get context logits and/or generation logits, you need to enable\n`--gather_context_logits` and/or `--gather_generation_logits` when building the\nengine (or `--gather_all_token_logits` to enable both at the same time). For\nmore setting details about these two flags, please refer to\n[build.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/commands/build.py)\nor\n[gpt_runtime](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-runtime.md).\n\nAfter launching the server, you could get the output of logits by passing the\ncorresponding parameters `--return-context-logits` and/or\n`--return-generation-logits` in the client scripts\n([end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)\nand\n[inflight_batcher_llm_client.py](./inflight_batcher_llm/client/inflight_batcher_llm_client.py)).\n\nFor example:\n\n```bash\npython3 ${INFLIGHT_BATCHER_LLM_CLIENT} --request-output-len 20 --tokenizer-dir ${TOKENIZER_DIR} --return-context-logits --return-generation-logits\n```\n\nThe result should be similar to the following:\n\n```bash\nInput sequence:  [28524, 287, 5093, 12, 23316, 4881, 11, 30022, 263, 8776, 355, 257]\nGot completed request\nInput: Born in north-east France, Soyer trained as a\nOutput beam 0:  has since worked in restaurants in London,\nOutput sequence:  [21221, 878, 3867, 284, 3576, 287, 262, 1903, 6303, 82, 13, 679, 468, 1201, 3111, 287, 10808, 287, 3576, 11]\ncontext_logits.shape: (1, 12, 50257)\ncontext_logits: [[[ -65.9822     -62.267445   -70.08991   ...  -76.16964    -78.8893\n    -65.90678  ]\n  [-103.40278   -102.55243   -106.119026  ... -108.925415  -109.408585\n   -101.37687  ]\n  [ -63.971176   -64.03466    -67.58809   ...  -72.141235   -71.16892\n    -64.23846  ]\n  ...\n  [ -80.776375   -79.1815     -85.50916   ...  -87.07368    -88.02817\n    -79.28435  ]\n  [ -10.551408    -7.786484   -14.524468  ...  -13.805856   -15.767286\n     -7.9322424]\n  [-106.33096   -105.58956   -111.44852   ... -111.04858   -111.994194\n   -105.40376  ]]]\ngeneration_logits.shape: (1, 1, 20, 50257)\ngeneration_logits: [[[[-106.33096  -105.58956  -111.44852  ... -111.04858  -111.994194\n    -105.40376 ]\n   [ -77.867424  -76.96638   -83.119095 ...  -87.82542   -88.53957\n     -75.64877 ]\n   [-136.92282  -135.02484  -140.96051  ... -141.78284  -141.55045\n    -136.01668 ]\n   ...\n   [-100.03721   -98.98237  -105.25507  ... -108.49254  -109.45882\n     -98.95136 ]\n   [-136.78777  -136.16165  -139.13437  ... -142.21495  -143.57468\n    -134.94667 ]\n   [  19.222942   19.127287   14.804495 ...   10.556551    9.685863\n      19.625107]]]]\n```\n\n##### Requests with batch size \u003e 1\n\nThe TRT-LLM backend supports requests with batch size greater than one. When\nsending a request with a batch size greater than one, the TRT-LLM backend will\nreturn multiple batch size 1 responses, where each response will be associated\nwith a given batch index. An output tensor named `batch_index` is associated\nwith each response to indicate which batch index this response corresponds to.\n\nThe client script\n[end_to_end_grpc_client.py](./inflight_batcher_llm/client/end_to_end_grpc_client.py)\ndemonstrates how a client can send requests with batch size \u003e 1 and consume the\nresponses returned from Triton. When passing `--batch-inputs` to the client\nscript, the client will create a request with multiple prompts, and use the\n`batch_index` output tensor to associate the responses to the original prompt.\nFor example one could run:\n\n```\npython3 /app/inflight_batcher_llm/client/end_to_end_grpc_client.py -o 5 -p '[\"This is a test\",\"I want you to\",\"The cat is\"]'  --batch-inputs\n```\n\nto send a request with a batch size of 3 to the Triton server.\n\n## Building from Source\n\nPlease refer to the [build.md](./docs/build.md) for more details on how to\nbuild the Triton TRT-LLM container from source.\n\n## Supported Models\n\nOnly a few examples are listed here. For all the supported models, please refer\nto the [support matrix](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html).\n\n- LLaMa\n  - [End to end workflow to run llama 7b with Triton](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md)\n  - [Build and run a LLaMA model in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama)\n  - [Llama Multi-instance](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md)\n  - [Deploying Hugging Face Llama2-7b Model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md#infer-with-tensorrt-llm-backend)\n\n- Gemma\n  - [End to end workflow to run sp model with Triton](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/gemma.md)\n  - [Run Gemma on TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/gemma)\n\n- Mistral\n  - [Build and run a Mixtral model in TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/mixtral/README.md)\n\n- Multi-modal\n  - [End to end workflow to run multimodal models(e.g. BLIP2-OPT, LLava1.5-7B, VILA) with Triton](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/multimodal.md)\n  - [Deploying Hugging Face Llava1.5-7b Model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llava1.5/llava_trtllm_guide.md)\n\n- Encoder-Decoder\n  - [End to end workflow to run an Encoder-Decoder model with Triton](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/encoder_decoder.md)\n\n## Model Config\n\nPlease refer to the [model config](./docs/model_config.md) for more details on\nthe model configuration.\n\n## Model Deployment\n\n### TRT-LLM Multi-instance Support\n\nTensorRT-LLM backend relies on MPI to coordinate the execution of a model across\nmultiple GPUs and nodes. Currently, there are two different modes supported to\nrun a model across multiple GPUs, **Leader Mode** and **Orchestrator Mode**.\n\n\u003e **Note**: This is different from the model multi-instance support from Triton\n\u003e Server which allows multiple instances of a model to be run on the same or\n\u003e different GPUs. For more information on Triton Server multi-instance support,\n\u003e please refer to the\n\u003e [Triton model config documentation](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups).\n\n#### Leader Mode\n\nIn leader mode, TensorRT-LLM backend spawns one Triton Server process for every\nGPU. The process with rank 0 is the leader process. Other Triton Server processes,\ndo not return from the `TRITONBACKEND_ModelInstanceInitialize` call to avoid\nport collision and allowing the other processes to receive requests.\n\nThe overview of this mode is described in the diagram below:\n\n![Leader Mode Overview](./images/leader-mode.png)\n\nThis mode is friendly with [slurm](https://slurm.schedmd.com) deployments since\nit doesn't use\n[MPI_Comm_spawn](https://www.open-mpi.org/doc/v4.1/man3/MPI_Comm_spawn.3.php).\n\n#### Orchestrator Mode\n\nIn orchestrator mode, the TensorRT-LLM backend spawns a single Triton Server\nprocess that acts as an orchestrator and spawns one Triton Server process for\nevery GPU that each model requires. This mode is mainly used when serving\nmultiple models with TensorRT-LLM backend. In this mode, the `MPI` world size\nmust be one as TRT-LLM backend will automatically create new workers as needed.\nThe overview of this mode is described in the diagram below:\n\n![Orchestrator Mode Overview](./images/orchestrator-mode.png)\n\nSince this mode uses\n[MPI_Comm_spawn](https://www.open-mpi.org/doc/v4.1/man3/MPI_Comm_spawn.3.php),\nit might not work properly with [slurm](https://slurm.schedmd.com) deployments.\nAdditionally, this currently only works for single node deployments.\n\n#### Running Multiple Instances of LLaMa Model\n\nPlease refer to\n[Running Multiple Instances of the LLaMa Model](docs/llama_multi_instance.md)\nfor more information on running multiple instances of LLaMa model in different\nconfigurations.\n\n### Multi-node Support\n\nCheck out the\n[Multi-Node Generative AI w/ Triton Server and TensorRT-LLM](https://github.com/triton-inference-server/tutorials/tree/main/Deployment/Kubernetes/TensorRT-LLM_Multi-Node_Distributed_Models)\ntutorial for Triton Server and TensorRT-LLM multi-node deployment.\n\n### Model Parallelism\n\n#### Tensor Parallelism, Pipeline Parallelism and Expert Parallelism\n\n[Tensor Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#tensor-parallelism),\n[Pipeline Parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#pipeline-parallelism)\nand\n[Expert parallelism](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html#expert-parallelism)\nare supported in TensorRT-LLM.\n\nSee the models in the\n[examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) folder for\nmore details on how to build the engines with tensor parallelism, pipeline\nparallelism and expert parallelism.\n\nSome examples are shown below:\n\n- Build LLaMA v3 70B using 4-way tensor parallelism and 2-way pipeline parallelism.\n\n```bash\npython3 convert_checkpoint.py --model_dir ./tmp/llama/70B/hf/ \\\n                            --output_dir ./tllm_checkpoint_8gpu_tp4_pp2 \\\n                            --dtype float16 \\\n                            --tp_size 4 \\\n                            --pp_size 2\n\ntrtllm-build --checkpoint_dir ./tllm_checkpoint_8gpu_tp4_pp2 \\\n            --output_dir ./tmp/llama/70B/trt_engines/fp16/8-gpu/ \\\n            --gemm_plugin auto\n```\n\n- Build Mixtral8x22B with tensor parallelism and expert parallelism\n\n```bash\npython3 ../llama/convert_checkpoint.py --model_dir ./Mixtral-8x22B-v0.1 \\\n                             --output_dir ./tllm_checkpoint_mixtral_8gpu \\\n                             --dtype float16 \\\n                             --tp_size 8 \\\n                             --moe_tp_size 2 \\\n                             --moe_ep_size 4\ntrtllm-build --checkpoint_dir ./tllm_checkpoint_mixtral_8gpu \\\n                 --output_dir ./trt_engines/mixtral/tp2ep4 \\\n                 --gemm_plugin float16\n```\n\nSee the\n[doc](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/expert-parallelism.md)\nto learn more about how TensorRT-LLM expert parallelism works in Mixture of Experts (MoE).\n\n### MIG Support\n\nSee the\n[MIG tutorial](https://github.com/triton-inference-server/tutorials/tree/main/Deployment/Kubernetes)\nfor more details on how to run TRT-LLM models and Triton with MIG.\n\n### Scheduling\n\nThe scheduler policy helps the batch manager adjust how requests are scheduled\nfor execution. There are two scheduler policies supported in TensorRT-LLM,\n`MAX_UTILIZATION` and `GUARANTEED_NO_EVICT`. See the\n[batch manager design](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/batch-manager.md#gptmanager-design)\nto learn more about how scheduler policies work. You can specify the scheduler\npolicy via the `batch_scheduler_policy` parameter in the\n[model config](./docs/model_config.md#tensorrt_llm_model) of tensorrt_llm model.\n\n### Key-Value Cache\n\nSee the\n[KV Cache](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#kv-cache)\nsection for more details on how TensorRT-LLM supports KV cache. Also, check out\nthe [KV Cache Reuse](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/kv_cache_reuse.md)\ndocumentation to learn more about how to enable KV cache reuse when building the\nTRT-LLM engine. Parameters for KV cache can be found in the\n[model config](./docs/model_config.md#tensorrt_llm_model) of tensorrt_llm model.\n\n### Decoding\n\n#### Decoding Modes - Top-k, Top-p, Top-k Top-p, Beam Search, Medusa, ReDrafter, Lookahead and Eagle\n\nTensorRT-LLM supports various decoding modes, including top-k, top-p,\ntop-k top-p, beam search Medusa, ReDrafter, Lookahead and Eagle. See the\n[Sampling Parameters](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-runtime.md#sampling-parameters)\nsection to learn more about top-k, top-p, top-k top-p and beam search decoding.\nPlease refer to the\n[speculative decoding documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/speculative-decoding.md)\nfor more details on Medusa, ReDrafter, Lookahead and Eagle.\n\nParameters for decoding modes can be found in the\n[model config](./docs/model_config.md#tensorrt_llm_model) of tensorrt_llm model.\n\n#### Speculative Decoding\n\nSee the\n[Speculative Decoding](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/speculative_decoding.md)\ndocumentation to learn more about how TensorRT-LLM supports speculative decoding\nto improve the performance. The parameters for speculative decoding can be found\nin the [model config](./docs/model_config.md#tensorrt_llm_bls_model) of\ntensorrt_llm_bls model.\n\n### Chunked Context\n\nFor more details on how to use chunked context, please refer to the\n[Chunked Context](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/gpt-attention.md#chunked-context)\nsection. Parameters for chunked context can be found in the\n[model config](./docs/model_config.md#tensorrt_llm_model) of tensorrt_llm model.\n\n### Quantization\n\nCheck out the\n[Quantization Guide](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/README.md)\nto learn more about how to install the quantization toolkit and quantize\nTensorRT-LLM models. Also, check out the blog post\n[Speed up inference with SOTA quantization techniques in TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/quantization-in-TRT-LLM.md)\nto learn more about how to speed up inference with quantization.\n\n### LoRa\n\nRefer to [lora.md](./docs/lora.md) for more details on how to use LoRa\nwith TensorRT-LLM and Triton.\n\n## Launch Triton server *within Slurm based clusters*\n\n### Prepare some scripts\n\n`tensorrt_llm_triton.sub`\n```bash\n#!/bin/bash\n#SBATCH -o logs/tensorrt_llm.out\n#SBATCH -e logs/tensorrt_llm.error\n#SBATCH -J \u003cREPLACE WITH YOUR JOB's NAME\u003e\n#SBATCH -A \u003cREPLACE WITH YOUR ACCOUNT's NAME\u003e\n#SBATCH -p \u003cREPLACE WITH YOUR PARTITION's NAME\u003e\n#SBATCH --nodes=1\n#SBATCH --ntasks-per-node=8\n#SBATCH --time=00:30:00\n\nsudo nvidia-smi -lgc 1410,1410\n\nsrun --mpi=pmix \\\n    --container-image triton_trt_llm \\\n    --container-workdir /tensorrtllm_backend \\\n    --output logs/tensorrt_llm_%t.out \\\n    bash /tensorrtllm_backend/tensorrt_llm_triton.sh\n```\n\n`tensorrt_llm_triton.sh`\n```bash\nTRITONSERVER=\"/opt/tritonserver/bin/tritonserver\"\nMODEL_REPO=\"/triton_model_repo\"\n\n${TRITONSERVER} --model-repository=${MODEL_REPO} --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix${SLURM_PROCID}_\n```\n\nIf srun initializes the mpi environment, you can use the following command to launch the Triton server:\n\n```bash\nsrun --mpi pmix launch_triton_server.py --oversubscribe\n```\n\n### Submit a Slurm job\n\n```bash\nsbatch tensorrt_llm_triton.sub\n```\n\nYou might have to contact your cluster's administrator to help you customize the above script.\n\n## Triton Metrics\n\nStarting with the 23.11 release of Triton, users can now obtain TRT LLM Batch\nManager [statistics](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/batch-manager.md#statistics)\nby querying the Triton metrics endpoint. This can be accomplished by launching\na Triton server in any of the ways described above (ensuring the build code /\ncontainer is 23.11 or later) and querying the server. Upon receiving a\nsuccessful response, you can query the metrics endpoint by entering the\nfollowing:\n\n```bash\ncurl localhost:8002/metrics\n```\n\nBatch manager statistics are reported by the metrics endpoint in fields that\nare prefixed with `nv_trt_llm_`. Your output for these fields should look\nsimilar to the following (assuming your model is an inflight batcher model):\n\n```bash\n# HELP nv_trt_llm_request_metrics TRT LLM request metrics\n# TYPE nv_trt_llm_request_metrics gauge\nnv_trt_llm_request_metrics{model=\"tensorrt_llm\",request_type=\"waiting\",version=\"1\"} 1\nnv_trt_llm_request_metrics{model=\"tensorrt_llm\",request_type=\"context\",version=\"1\"} 1\nnv_trt_llm_request_metrics{model=\"tensorrt_llm\",request_type=\"scheduled\",version=\"1\"} 1\nnv_trt_llm_request_metrics{model=\"tensorrt_llm\",request_type=\"max\",version=\"1\"} 512\nnv_trt_llm_request_metrics{model=\"tensorrt_llm\",request_type=\"active\",version=\"1\"} 0\n# HELP nv_trt_llm_runtime_memory_metrics TRT LLM runtime memory metrics\n# TYPE nv_trt_llm_runtime_memory_metrics gauge\nnv_trt_llm_runtime_memory_metrics{memory_type=\"pinned\",model=\"tensorrt_llm\",version=\"1\"} 0\nnv_trt_llm_runtime_memory_metrics{memory_type=\"gpu\",model=\"tensorrt_llm\",version=\"1\"} 1610236\nnv_trt_llm_runtime_memory_metrics{memory_type=\"cpu\",model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_trt_llm_kv_cache_block_metrics TRT LLM KV cache block metrics\n# TYPE nv_trt_llm_kv_cache_block_metrics gauge\nnv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=\"fraction\",model=\"tensorrt_llm\",version=\"1\"} 0.4875\nnv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=\"tokens_per\",model=\"tensorrt_llm\",version=\"1\"} 64\nnv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=\"used\",model=\"tensorrt_llm\",version=\"1\"} 1\nnv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=\"free\",model=\"tensorrt_llm\",version=\"1\"} 6239\nnv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=\"max\",model=\"tensorrt_llm\",version=\"1\"} 6239\n# HELP nv_trt_llm_inflight_batcher_metrics TRT LLM inflight_batcher-specific metrics\n# TYPE nv_trt_llm_inflight_batcher_metrics gauge\nnv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric=\"micro_batch_id\",model=\"tensorrt_llm\",version=\"1\"} 0\nnv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric=\"generation_requests\",model=\"tensorrt_llm\",version=\"1\"} 0\nnv_trt_llm_inflight_batcher_metrics{inflight_batcher_specific_metric=\"total_context_tokens\",model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_trt_llm_general_metrics General TRT LLM metrics\n# TYPE nv_trt_llm_general_metrics gauge\nnv_trt_llm_general_metrics{general_type=\"iteration_counter\",model=\"tensorrt_llm\",version=\"1\"} 0\nnv_trt_llm_general_metrics{general_type=\"timestamp\",model=\"tensorrt_llm\",version=\"1\"} 1700074049\n# HELP nv_trt_llm_disaggregated_serving_metrics TRT LLM disaggregated serving metrics\n# TYPE nv_trt_llm_disaggregated_serving_metrics counter\nnv_trt_llm_disaggregated_serving_metrics{disaggregated_serving_type=\"kv_cache_transfer_ms\",model=\"tensorrt_llm\",version=\"1\"} 0\nnv_trt_llm_disaggregated_serving_metrics{disaggregated_serving_type=\"request_count\",model=\"tensorrt_llm\",version=\"1\"} 0\n```\n\nIf, instead, you launched a V1 model, your output will look similar to the\noutput above except the inflight batcher related fields will be replaced\nwith something similar to the following:\n\n```bash\n# HELP nv_trt_llm_v1_metrics TRT LLM v1-specific metrics\n# TYPE nv_trt_llm_v1_metrics gauge\nnv_trt_llm_v1_metrics{model=\"tensorrt_llm\",v1_specific_metric=\"total_generation_tokens\",version=\"1\"} 20\nnv_trt_llm_v1_metrics{model=\"tensorrt_llm\",v1_specific_metric=\"empty_generation_slots\",version=\"1\"} 0\nnv_trt_llm_v1_metrics{model=\"tensorrt_llm\",v1_specific_metric=\"total_context_tokens\",version=\"1\"} 5\n```\n\nPlease note that versions of Triton prior to the 23.12 release do not\nsupport base Triton metrics. As such, the following fields will report 0:\n\n```bash\n# HELP nv_inference_request_success Number of successful inference requests, all batch sizes\n# TYPE nv_inference_request_success counter\nnv_inference_request_success{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes\n# TYPE nv_inference_request_failure counter\nnv_inference_request_failure{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_count Number of inferences performed (does not include cached requests)\n# TYPE nv_inference_count counter\nnv_inference_count{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)\n# TYPE nv_inference_exec_count counter\nnv_inference_exec_count{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)\n# TYPE nv_inference_request_duration_us counter\nnv_inference_request_duration_us{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)\n# TYPE nv_inference_queue_duration_us counter\nnv_inference_queue_duration_us{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)\n# TYPE nv_inference_compute_input_duration_us counter\nnv_inference_compute_input_duration_us{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)\n# TYPE nv_inference_compute_infer_duration_us counter\nnv_inference_compute_infer_duration_us{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)\n# TYPE nv_inference_compute_output_duration_us counter\nnv_inference_compute_output_duration_us{model=\"tensorrt_llm\",version=\"1\"} 0\n# HELP nv_inference_pending_request_count Instantaneous number of pending requests awaiting execution per-model.\n# TYPE nv_inference_pending_request_count gauge\nnv_inference_pending_request_count{model=\"tensorrt_llm\",version=\"1\"} 0\n```\n\n## Benchmarking\n\nCheck out [GenAI-Perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf)\ntool for benchmarking TensorRT-LLM models.\n\nYou can also use the\n[benchmark_core_model script](./tools/inflight_batcher_llm/benchmark_core_model.py)\nto benchmark the core model `tensosrrt_llm`. The script sends requests directly\nto deployed `tensorrt_llm` model. The benchmark core model latency indicates the\ninference latency of TensorRT-LLM, not including the pre/post-processing latency\nwhich is usually handled by a third-party library such as HuggingFace.\n\nbenchmark_core_model can generate traffic from 2 sources.\n1 - dataset (json file containing prompts and optional responses)\n2 - token normal distribution (user specified input, output seqlen)\n\nBy default, exponential distrution is used to control arrival rate of requests.\nIt can be changed to constant arrival time.\n\n```bash\ncd tools/inflight_batcher_llm\n```\n\nExample: Run dataset with 10 req/sec requested rate with provided tokenizer.\n\n```bash\npython3 benchmark_core_model.py -i grpc --request_rate 10 dataset --dataset \u003cdataset path\u003e --tokenizer_dir \u003c\u003e --num_requests 5000\n```\n\nExample: Generate I/O seqlen tokens with input normal distribution with mean_seqlen=128, stdev=10. Output normal distribution with mean_seqlen=20, stdev=2. Set stdev=0 to get constant seqlens.\n\n```bash\npython3 benchmark_core_model.py -i grpc --request_rate 10 token_norm_dist --input_mean 128 --input_stdev 5 --output_mean 20 --output_stdev 2 --num_requests 5000\n```\n\nExpected outputs\n\n```bash\n[INFO] Warm up for benchmarking.\n[INFO] Start benchmarking on 5000 prompts.\n[INFO] Total Latency: 26585.349 ms\n[INFO] Total request latencies: 11569672.000999955 ms\n+----------------------------+----------+\n|            Stat            |  Value   |\n+----------------------------+----------+\n|        Requests/Sec        |  188.09  |\n|       OP tokens/sec        | 3857.66  |\n|     Avg. latency (ms)      | 2313.93  |\n|      P99 latency (ms)      | 3624.95  |\n|      P90 latency (ms)      | 3127.75  |\n| Avg. IP tokens per request |  128.53  |\n| Avg. OP tokens per request |  20.51   |\n|     Total latency (ms)     | 26582.72 |\n|       Total requests       | 5000.00  |\n+----------------------------+----------+\n\n```\n*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*\n\n## Testing the TensorRT-LLM Backend\nPlease follow the guide in [`ci/README.md`](ci/README.md) to see how to run\nthe testing for TensorRT-LLM backend.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftriton-inference-server%2Ftensorrtllm_backend","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftriton-inference-server%2Ftensorrtllm_backend","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftriton-inference-server%2Ftensorrtllm_backend/lists"}