{"id":31871748,"url":"https://github.com/triple-mu/qwen-image-tensorrt","last_synced_at":"2025-10-12T20:59:35.418Z","repository":{"id":313606502,"uuid":"1051963839","full_name":"triple-Mu/Qwen-Image-TensorRT","owner":"triple-Mu","description":"Qwen-Image's DiT inference with TensorRT-10","archived":false,"fork":false,"pushed_at":"2025-09-07T07:56:58.000Z","size":13,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-07T09:25:07.040Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/triple-Mu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-07T05:18:27.000Z","updated_at":"2025-09-07T07:57:34.000Z","dependencies_parsed_at":"2025-09-07T18:45:50.393Z","dependency_job_id":null,"html_url":"https://github.com/triple-Mu/Qwen-Image-TensorRT","commit_stats":null,"previous_names":["triple-mu/qwen-image-tensorrt"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/triple-Mu/Qwen-Image-TensorRT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triple-Mu%2FQwen-Image-TensorRT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triple-Mu%2FQwen-Image-TensorRT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triple-Mu%2FQwen-Image-TensorRT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triple-Mu%2FQwen-Image-TensorRT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/triple-Mu","download_url":"https://codeload.github.com/triple-Mu/Qwen-Image-TensorRT/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/triple-Mu%2FQwen-Image-TensorRT/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279012812,"owners_count":26085191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-12T02:00:06.719Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-12T20:59:34.168Z","updated_at":"2025-10-12T20:59:35.410Z","avatar_url":"https://github.com/triple-Mu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Qwen-Image-TensorRT\n\nQwen-Image's DiT inference with TensorRT-10\n\n## ENV\n\nThe project was tested in the following environment:\n\n- Ubuntu 18.04\n- NVIDIA Driver 525.125.06\n- [`CUDA`](https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run)\n  11.8\n- [`Python`](https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-py39_25.7.0-2-Linux-x86_64.sh) 3.10.18\n- [\n  `PyTorch`](https://download.pytorch.org/whl/cu118/torch-2.6.0%2Bcu118-cp310-cp310-linux_x86_64.whl#sha256=715d3b039a629881f263c40d1fb65edac6786da13bfba221b353ef2371c4da86)\n  2.6.0+cu118\n- [`Diffusers`](https://github.com/huggingface/diffusers/commit/fc337d585309c4b032e8d0180bea683007219df1) 0.36.0.dev0\n- [\n  `ONNX`](https://files.pythonhosted.org/packages/79/21/9bcc715ea6d9aab3f6c583bfc59504a14777e39e0591030e7345f4e40315/onnx-1.19.0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl)\n  1.19.0\n- [\n  `TensorRT`](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.13.0/tars/TensorRT-10.13.0.35.Linux.x86_64-gnu.cuda-11.8.tar.gz)\n  10.13.0.35\n- [`cudnn-frontend`](https://github.com/NVIDIA/cudnn-frontend/commit/1a7b4b78db44712fb9707d21cd2e3179f1fd88b8) 1.14.1\n\n```shell\n# Create conda env\nconda create -n qwen-image python=3.10\nconda activate qwen-image\n\n# Install PyTorch\npip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118\n# Install Diffusers\npip install git+https://github.com/huggingface/diffusers.git@fc337d585309c4b032e8d0180bea683007219df1\n# Install ONNX\npip install onnx==1.19.0\n\n# Install TensorRT\nwget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.13.0/tars/TensorRT-10.13.0.35.Linux.x86_64-gnu.cuda-11.8.tar.gz\ntar -xf TensorRT-10.13.0.35.Linux.x86_64-gnu.cuda-11.8.tar.gz\npip install TensorRT-10.13.2.6/python/tensorrt-10.13.2.6-cp310-none-linux_x86_64.whl\nexport PATH=${PWD}/TensorRT-10.13.2.6/bin:$PATH\nexport LD_LIBRARY_PATH=${PWD}/TensorRT-10.13.2.6/lib:$LD_LIBRARY_PATH\n\n# Install cudnn-frontend\n# tensorrt-plugin is coming soon\n```\n\n## CONVERT TO ONNX\n\nClone the project first:\n\n```shell\ngit clone https://github.com/triple-Mu/Qwen-Image-TensorRT.git\ncd Qwen-Image-TensorRT\n```\n\nHere are some scripts to test exporting onnx:\n\n- [`1-export-dit-directly.py`](./step_by_step/1-export-dit-directly.py)\n\n```shell\npython step_by_step/1-export-dit-directly.py --model_path Qwen/Qwen-Image --onnx_path transformer_step1.onnx\n```\n\nThis script almost no modifications, so the export fails with the following error:\n\n```text\n  File \"/root/anaconda3/envs/qwen-image/lib/python3.10/site-packages/torch/onnx/_internal/jit_utils.py\", line 308, in _create_node\n    _C._jit_pass_onnx_node_shape_type_inference(node, params_dict, opset_version)\nRuntimeError: ScalarType ComplexFloat is an unexpected tensor scalar type\n```\n\nSince ONNX does not support complex operators, proceed to step 2.\n\n- [`2-remove-complex-op.py`](./step_by_step/2-remove-complex-op.py)\n\n```shell\npython step_by_step/2-remove-complex-op.py --model_path Qwen/Qwen-Image --onnx_path transformer_step2.onnx\n```\n\nAfter removing `self.pos_embed` and replacing `apply_rotary_emb_qwen`, it works fine.\n\n- [`3-merge-qkv-projection.py`](./step_by_step/3-merge-qkv-projection.py)\n\n```shell\npython step_by_step/3-merge-qkv-projection.py --model_path Qwen/Qwen-Image --onnx_path transformer_step3.onnx\n```\n\nAdvanced: Merging QKV GEMM reduces kernel launches and increases throughput.\n\n- [`4-cudnn-attention-plugin.py`](./step_by_step/4-cudnn-attention-plugin.py)\n\n```shell\npython step_by_step/4-cudnn-attention-plugin.py --model_path Qwen/Qwen-Image --onnx_path transformer_step4.onnx\n```\n\n*COMING SOON!*\n\nAdvanced: Replacing sdpa with cudnn-attention, it results in a significant improvement on A100 GPU.\n\n## CONVERT TO TensorRT\n\nAfter convert `QwenImageTransformer2DModel` to ONNX, the tensorrt engine can be built by `trtexec`.\n\nRefer to [`2-build_engine.sh`](./scripts/2-build_engine.sh)\n\nSet up `TENSORRT_ROOT` `ONNX_PATH` and `ENGINE_PATH` first, and the min/opt/max shape also can be modified by yourself.\n\nThen run:\n\n```shell\nbash scripts/2-build_engine.sh\n```\n\nThe following log output will be shown:\n\n```text\n[09/07/2025-21:42:26] [I] === Trace details ===\n[09/07/2025-21:42:26] [I] Trace averages of 10 runs:\n[09/07/2025-21:42:26] [I] Average on 10 runs - GPU latency: 1666.2 ms - Host latency: 1666.9 ms (enqueue 1663.95 ms)\n[09/07/2025-21:42:26] [I] \n[09/07/2025-21:42:26] [I] === Performance summary ===\n[09/07/2025-21:42:26] [I] Throughput: 0.562059 qps\n[09/07/2025-21:42:26] [I] Latency: min = 1656.22 ms, max = 1674.64 ms, mean = 1666.9 ms, median = 1667.89 ms, percentile(90%) = 1673.26 ms, percentile(95%) = 1674.64 ms, percentile(99%) = 1674.64 ms\n[09/07/2025-21:42:26] [I] Enqueue Time: min = 1650.99 ms, max = 1672.49 ms, mean = 1663.95 ms, median = 1663.63 ms, percentile(90%) = 1672.08 ms, percentile(95%) = 1672.49 ms, percentile(99%) = 1672.49 ms\n[09/07/2025-21:42:26] [I] H2D Latency: min = 0.631348 ms, max = 0.640015 ms, mean = 0.635217 ms, median = 0.635742 ms, percentile(90%) = 0.63623 ms, percentile(95%) = 0.640015 ms, percentile(99%) = 0.640015 ms\n[09/07/2025-21:42:26] [I] GPU Compute Time: min = 1655.52 ms, max = 1673.94 ms, mean = 1666.2 ms, median = 1667.19 ms, percentile(90%) = 1672.56 ms, percentile(95%) = 1673.94 ms, percentile(99%) = 1673.94 ms\n[09/07/2025-21:42:26] [I] D2H Latency: min = 0.0585938 ms, max = 0.0664062 ms, mean = 0.0639648 ms, median = 0.0644531 ms, percentile(90%) = 0.0654297 ms, percentile(95%) = 0.0664062 ms, percentile(99%) = 0.0664062 ms\n[09/07/2025-21:42:26] [I] Total Host Walltime: 17.7917 s\n[09/07/2025-21:42:26] [I] Total GPU Compute Time: 16.662 s\n[09/07/2025-21:42:26] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.\n[09/07/2025-21:42:26] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.\n[09/07/2025-21:42:26] [I] Explanations of the performance metrics are printed in the verbose logs.\n[09/07/2025-21:42:26] [I] \n\u0026\u0026\u0026\u0026 PASSED TensorRT.trtexec [TensorRT v101300] [b35] # trtexec --onnx=transformer_step2.onnx --saveEngine=transformer_step2.plan --bf16 --optShapes=hidden_states:1x6032x64,encoder_hidden_states:1x128x3584,timestep:1,img_rope_real:6032x64,img_rope_imag:6032x64,txt_rope_real:128x64,txt_rope_imag:128x64 --minShapes=hidden_states:1x3364x64,encoder_hidden_states:1x1x3584,timestep:1,img_rope_real:3364x64,img_rope_imag:3364x64,txt_rope_real:1x64,txt_rope_imag:1x64 --maxShapes=hidden_states:1x10816x64,encoder_hidden_states:1x1024x3584,timestep:1,img_rope_real:10816x64,img_rope_imag:10816x64,txt_rope_real:1024x64,txt_rope_imag:1024x64 --shapes=hidden_states:1x10816x64,encoder_hidden_states:1x1024x3584,timestep:1,img_rope_real:10816x64,img_rope_imag:10816x64,txt_rope_real:1024x64,txt_rope_imag:1024x64\n```\n\n## RUNNING TensorRT Pipeline!\n\nAfter convert ONNX to Engine, the pipeline can be built with Diffusers's pipeline.\n\nRefer to [`run_trt_pipeline.py`](./run_trt_pipeline.py)\n\nRun:\n\n```shell\npython run_trt_pipeline.py --model_path Qwen/Qwen-Image --trt_path transformer_step2.engine\n```\n\nThen the example output image will be saved at [`example.png`](./example.png).\n\n## CUDNN-ATTENTION Plugin!\n\n*COMING SOON!*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftriple-mu%2Fqwen-image-tensorrt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftriple-mu%2Fqwen-image-tensorrt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftriple-mu%2Fqwen-image-tensorrt/lists"}