{"id":19023672,"url":"https://github.com/neuralmagic/tensorrt-demo","last_synced_at":"2025-07-09T16:13:21.738Z","repository":{"id":241804752,"uuid":"807691575","full_name":"neuralmagic/tensorrt-demo","owner":"neuralmagic","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-07T18:56:01.000Z","size":51,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-21T18:43:52.537Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neuralmagic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-29T15:33:41.000Z","updated_at":"2024-06-07T18:56:04.000Z","dependencies_parsed_at":"2024-06-10T08:18:36.245Z","dependency_job_id":null,"html_url":"https://github.com/neuralmagic/tensorrt-demo","commit_stats":null,"previous_names":["neuralmagic/tensorrt-demo"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neuralmagic/tensorrt-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralmagic%2Ftensorrt-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralmagic%2Ftensorrt-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralmagic%2Ftensorrt-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralmagic%2Ftensorrt-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neuralmagic","download_url":"https://codeload.github.com/neuralmagic/tensorrt-demo/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neuralmagic%2Ftensorrt-demo/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264492796,"owners_count":23617052,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T20:31:49.273Z","updated_at":"2025-07-09T16:13:21.669Z","avatar_url":"https://github.com/neuralmagic.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## TensorRT-Demo\n\nFirst, clone the [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo) repository:\n\n```bash\ngit clone git@github.com:neuralmagic/tensorrt-demo.git\ncd tensorrt-demo\nexport tensorrt_demo_dir=`pwd`\n\n```\n\nThen, clone the [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend) repository:\n\n```bash\ngit clone git@github.com:triton-inference-server/tensorrtllm_backend.git\ncd tensorrtllm_backend\nexport tensorrtllm_backend_dir=`pwd`\ngit lfs install\n```\n\nEnsure that the version of [tensorrtllm_backend](https://github.com/triton-inference-server/tensorrtllm_backend) is set to [r24.04](https://github.com/triton-inference-server/tensorrtllm_backend/tree/r24.04):\n\n```bash\ngit fetch --all\ngit checkout -b r24.04 -t origin/r24.04\n\ngit submodule update --init --recursive\n```\n\nCopy **triton_model_repo** directory from tensorrt-demo to tensorrtllm_backend: \n\n```bash\ncp -r ${tensorrt_demo_dir}/triton_model_repo ${tensorrtllm_backend_dir}/\n```\n\nStart **trt-llm-triton** docker:\n\n```bash\nexport models_dir=$HOME/models\ndocker run -it -d --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia --gpus all -v ${tensorrtllm_backend_dir}:/tensorrtllm_backend  -v $HOME/models:/models -v ${tensorrt_demo_dir}:/root/tensorrt-demo --name triton_server nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3 bash\n\ndocker exec -it triton_server /bin/bash\n```\n\nSet model params. Modify *model_type* and *model_name* to point to your model, and modify the model dtype/tp_size/max_batch_size etc... based on your requirements:\n\n```bash\nexport models_dir=/models\n\nexport model_type=llama\nexport model_name=Meta-Llama-3-70B-Instruct\n\nexport model_dtype=float16\nexport model_tp_size=2\n\nexport max_batch_size=256\nexport max_input_len=2048\nexport max_output_len=1024\n\nexport model_path=${models_dir}/${model_name}\nexport trt_model_path=${models_dir}/${model_name}-trt-ckpt\nexport trt_engine_path=${models_dir}/${model_name}-trt-engine\n```\n\nConvert hugging face checkpoint to TRT checkpoint:\n\n```bash\ncd /tensorrtllm_backend\ncd ./tensorrt_llm/examples/${model_type}\n\npython3 convert_checkpoint.py \\\n    --model_dir ${model_path} \\\n    --dtype ${model_dtype} \\\n    --tp_size ${model_tp_size} \\\n    --output_dir ${trt_model_path} \\\n```\n\nCompile TRT checkpoint to TRT engine:\n\n```bash     \n# Choose to enable/disable chunked prompt\nexport CHUNKED_PROMPT_FLAGS=\nexport CHUNKED_PROMPT_FLAGS=\"--context_fmha=enable --use_paged_context_fmha=enable --context_fmha_fp32_acc=enable --multi_block_mode=enable\"\n\ntrtllm-build \\\n    --checkpoint_dir=${trt_model_path} \\\n    --gpt_attention_plugin=${model_dtype} \\\n    --gemm_plugin=${model_dtype} \\\n    --remove_input_padding=enable \\\n    --paged_kv_cache=enable \\\n    --tp_size=${model_tp_size} \\\n    --max_batch_size=${max_batch_size} \\\n    --max_input_len=${max_input_len} \\\n    --max_output_len=${max_output_len} \\\n    --max_num_tokens=${max_output_len} \\\n    --opt_num_tokens=${max_output_len} \\\n    --output_dir=${trt_engine_path} \\\n    $CHUNKED_PROMPT_FLAGS\n\n```\n\nCopy the generated TRT engine to *triton_model_repo* as follows:\n\n```bash     \ncd /tensorrtllm_backend/triton_model_repo\ncp -r ${trt_engine_path}/* ./tensorrt_llm/1\n```\n\nModify **triton_model_repo** config files as follows:\n1. Modify **ensemble/config.pbtxt**: \n\n| Param | Value |\n| ----- | ----- |\n| `max_batch_size` | Set to the value of **${max_batch_size}**  |\n\n2. Modify **preprocessing/config.pbtxt**: \n\n| Param | Value |\n| ----- | ----- |\n| `max_batch_size` | Set to the value of **${max_batch_size}**  |\n| `tokenizer_dir` | Set to the value of **${model_path}**  |\n\n3. Modify **postprocessing/config.pbtxt**: \n\n| Param | Value |\n| ----- | ----- |\n| `max_batch_size` | Set to the value of **${max_batch_size}**  |\n| `tokenizer_dir` | Set to the value of **${model_path}**  |\n\n4. Modify **tensorrt_llm/config.pbtxt**: \n\n| Param | Value |\n| ----- | ----- |\n| `max_batch_size` | Set to the value of **${max_batch_size}**  |\n| `decoupled` | Ensure it is set to **true** (to allow generate_stream)  |\n| `gpt_model_type` | Ensure it is using **inflight_fused_batching** to allow continuous batching of requests  |\n| `batch_scheduler_policy` | Ensure it is using **max_utilization** to batch requests as much as possible  |\n| `kv_cache_free_gpu_mem_fraction` | Ensure it is set to **0.9**. This value indicates the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.  |\n\n\n4. Modify **tensorrt_llm_bls/config.pbtxt**: \n\n| Param | Value |\n| ----- | ----- |\n| `max_batch_size` | Set to the value of **${max_batch_size}**  |\n| `decoupled` | Ensure it is set to **true** (to allow generate_stream)  |\n\nStart Triton server:\n\n```bash\ncd /tensorrtllm_backend\npython3 scripts/launch_triton_server.py --world_size=${model_tp_size} --model_repo=/tensorrtllm_backend/triton_model_repo\n```\n\nEnsure that the triton-server is loaded correctly by checking that the model parts are in READY state, like in this output:\n\n```bash\nI0530 15:11:18.363912 56200 server.cc:677] \n+------------------+---------+--------+\n| Model            | Version | Status |\n+------------------+---------+--------+\n| ensemble         | 1       | READY  |\n| postprocessing   | 1       | READY  |\n| preprocessing    | 1       | READY  |\n| tensorrt_llm     | 1       | READY  |\n| tensorrt_llm_bls | 1       | READY  |\n+------------------+---------+--------+\n\nI0530 15:11:18.675865 56200 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA A100-SXM4-80GB\n```\n\nAt this point, triton-server is running inside the docker container, so we can exit the docker or go to another terminal to run the client.\n\nFor client benchmarking, we are using [benchmark_serving.py](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) from the [vLLM](https://github.com/vllm-project/vllm) repository.\n\nFirst, clone the vLLM repository and install the package (a clean virtualenv is recommended here):\n\n```bash\ngit clone git@github.com:vllm-project/vllm.git\ncd vllm\nexport vllm_dir=`pwd`\npip install -e .\n```\n\nNow, we can run benchmark_serving.py to benchmark the triton-server:\n\n```bash\ncd ${vllm_dir}\ncd benchmarks\n\n# This is the same model from above\nexport model_name=Meta-Llama-3-70B-Instruct\n\n# Modify --sonnet-input-len, --sonnet-prefix-len, --sonnet-output-len and --request-rate based on your requirements \npython benchmark_serving.py --backend tensorrt-llm --endpoint /v2/models/ensemble/generate_stream  --host 0.0.0.0 --port 8000 --model $HOME/models/${model_name} --num-prompts 100 --save-result --dataset-name sonnet --dataset-path sonnet.txt --sonnet-input-len 512 --sonnet-prefix-len 256 --sonnet-output-len 256 --request-rate 1\n```\n\nTo run a vLLM server, we need first to match its **--gpu-memory-utilization** parameter with triton's **--kv_cache_free_gpu_mem_fraction**. Above, we have set **--kv_cache_free_gpu_mem_fraction=0.9**, however, it is not the same as vLLM's default **--gpu-memory-utilization=0.9**, since triton's parameter is relating to the fraction of GPU memory that we have **after loading the model** (where in vLLM it is before loading the model). Therefore, the right --gpu-memory-utilization for vLLM would be computed as *((GPU_TOTAL_MEMORY - MODEL_MEMORY) \\* 0.9 + MODEL_MEMORY) / GPU_TOTAL_MEMORY*. For LLama3 70B FP16 with *MODEL_MEMORY=68296MB*, and A100 GPU with *GPU_TOTAL_MEMORY=81920MB*, we get *((81920-68296)\\*0.9 + 68296) / 81920 = 0.9833*, so we need to use **--gpu-memory-utilization=0.9833** in this case.\n\n```bash\ncd ${vllm_dir}\n\n# These are the same model params from above (that were used inside the docker container)\nexport model_name=Meta-Llama-3-70B-Instruct\nexport model_tp_size=2\nexport model_dtype=float16\nexport max_input_len=2048\nexport vllm_gpu_memory_utilization=0.9833 \n\n# Run server\npython3 vllm/entrypoints/openai/api_server.py --model $HOME/models/${model_name} --max-model-len ${max_input_len} --disable-log-requests --enforce-eager --tensor-parallel-size ${model_tp_size} --dtype=${model_dtype} --port 8888 --gpu-memory-utilization ${vllm_gpu_memory_utilization}\n```\n\nRun benchmark_serving.py to benchmark the vllm-server:\n\n```bash\ncd ${vllm_dir}\ncd benchmarks\n\nexport model_name=Meta-Llama-3-70B-Instruct\n\n# Modify --sonnet-input-len, --sonnet-prefix-len, --sonnet-output-len and --request-rate based on your requirements \npython benchmark_serving.py --backend vllm --host localhost --port 8888 --endpoint /v1/completions --model $HOME/models/${model_name} --num-prompts 100 --save-result --dataset-name sonnet --dataset-path sonnet.txt --sonnet-input-len 512 --sonnet-prefix-len 256 --sonnet-output-len 256 --request-rate 1\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralmagic%2Ftensorrt-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneuralmagic%2Ftensorrt-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneuralmagic%2Ftensorrt-demo/lists"}