{"id":15029579,"url":"https://github.com/shouxieai/tensorrt_pro","last_synced_at":"2025-12-13T22:11:02.519Z","repository":{"id":37656376,"uuid":"389495808","full_name":"shouxieai/tensorRT_Pro","owner":"shouxieai","description":"C++ library based on tensorrt integration","archived":false,"fork":false,"pushed_at":"2023-05-24T05:27:21.000Z","size":114334,"stargazers_count":2722,"open_issues_count":95,"forks_count":560,"subscribers_count":34,"default_branch":"main","last_synced_at":"2025-04-05T00:02:03.122Z","etag":null,"topics":["deep-learning","object-detection","pytorch","tensorrt","yolov5","yolox"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shouxieai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-07-26T03:20:10.000Z","updated_at":"2025-04-03T06:42:57.000Z","dependencies_parsed_at":"2024-01-14T12:28:27.979Z","dependency_job_id":null,"html_url":"https://github.com/shouxieai/tensorRT_Pro","commit_stats":null,"previous_names":["shouxieai/tensorrt_cpp"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shouxieai%2FtensorRT_Pro","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shouxieai%2FtensorRT_Pro/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shouxieai%2FtensorRT_Pro/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shouxieai%2FtensorRT_Pro/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shouxieai","download_url":"https://codeload.github.com/shouxieai/tensorRT_Pro/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248509963,"owners_count":21116125,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","object-detection","pytorch","tensorrt","yolov5","yolox"],"created_at":"2024-09-24T20:11:05.862Z","updated_at":"2025-12-13T22:10:57.809Z","avatar_url":"https://github.com/shouxieai.png","language":"C++","readme":"*Read this in other languages: [English](README.md), [简体中文](tutorial/README.zh-cn.md).*\n\n## News: \n- 🔥 A simple implementation is released: https://github.com/shouxieai/infer\n- 🔥 Add yolov7 support .\n- 🔥 Released python solution for hardware decoding with tensorRT integration\n- 🔥 Docker Image has been released：https://hub.docker.com/r/hopef/tensorrt-pro\n- ⚡tensorRT_Pro_comments_version(co-contributing version) is also provided for a better learning experience. Repo: https://github.com/Guanbin-Huang/tensorRT_Pro_comments\n- 🔥 [Simple yolov5/yolox implemention is released. Simple and easy to use.](example-simple_yolo)\n- 🔥 yolov5-1.0-6.0/master are supported.\n- Tutorial notebooks download:\n  - [WarpAffine.lesson.tar.gz](http://zifuture.com:1000/fs/25.shared/warpaffine.lesson.tar.gz)\n  - [Offset.tar.gz](http://zifuture.com:1000/fs/25.shared/offset.tar.gz)\n- Tutorial for exporting CenterNet from pytorch to tensorRT is released. \n\n## Tutorial Video\n\n- \u003cb\u003eblibli\u003c/b\u003e : https://www.bilibili.com/video/BV1Xw411f7FW (Now only in Chinese. English is comming)\n- \u003cb\u003eslides\u003c/b\u003e : http://zifuture.com:1556/fs/sxai/tensorRT.pptx (Now only in Chinese. English is comming)\n- \u003cb\u003etutorial folder\u003c/b\u003e: a good intro for beginner to get a general idea of our framework.(Chinese/English)\n\n## An Out-of-the-Box TensorRT-based Framework for High Performance Inference with C++/Python Support\n\n- C++ Interface: 3 lines of code is all you need to run a YoloX\n\n  ```C++\n  // create inference engine on gpu-0\n  //auto engine = Yolo::create_infer(\"yolov5m.fp32.trtmodel\", Yolo::Type::V5, 0);\n  auto engine = Yolo::create_infer(\"yolox_m.fp32.trtmodel\", Yolo::Type::X, 0);\n  \n  // load image\n  auto image = cv::imread(\"1.jpg\");\n  \n  // do inference and get the result\n  auto box = engine-\u003ecommit(image).get();  // return vector\u003cBox\u003e\n  ```\n\n- Python Interface:\n  ```python\n  import pytrt\n  \n  model     = models.resnet18(True).eval().to(device)\n  trt_model = tp.from_torch(model, input)\n  trt_out   = trt_model(input)\n  ```\n  \n  - simple yolo for python\n  ```python\n  import os\n  import cv2\n  import numpy as np\n  import pytrt as tp\n\n  engine_file = \"yolov5s.fp32.trtmodel\"\n  if not os.path.exists(engine_file):\n      tp.compile_onnx_to_file(1, tp.onnx_hub(\"yolov5s\"), engine_file)\n\n  yolo   = tp.Yolo(engine_file, type=tp.YoloType.V5)\n  image  = cv2.imread(\"car.jpg\")\n  bboxes = yolo.commit(image).get()\n  print(f\"{len(bboxes)} objects\")\n\n  for box in bboxes:\n      left, top, right, bottom = map(int, [box.left, box.top, box.right, box.bottom])\n      cv2.rectangle(image, (left, top), (right, bottom), tp.random_color(box.class_label), 5)\n\n  saveto = \"yolov5.car.jpg\"\n  print(f\"Save to {saveto}\")\n\n  cv2.imwrite(saveto, image)\n  cv2.imshow(\"result\", image)\n  cv2.waitKey()\n  ```\n\n## INTRO\n\n1. High level interface for C++/Python.\n2. Simplify the implementation of custom plugin. And serialization and deserialization have been encapsulated for easier usage.\n3. Simplify the compile of fp32, fp16 and int8 for facilitating the deployment with C++/Python in server or embeded device.\n4. Models ready for use also with examples are RetinaFace, Scrfd, YoloV5, YoloX, Arcface, AlphaPose, CenterNet and DeepSORT(C++)\n\n## YoloX and YoloV5-series Model Test Report\n\n\u003cdetails\u003e\n\u003csummary\u003eapp_yolo.cpp speed testing\u003c/summary\u003e\n  \n1. Resolution (YoloV5P5, YoloX) = (640x640),  (YoloV5P6) = (1280x1280)\n2. max batch size = 16\n3. preprocessing + inference + postprocessing\n4. cuda10.2, cudnn8.2.2.26, TensorRT-8.0.1.6\n5. RTX2080Ti\n6. num of testing: take the average on the results of 100 times but excluding the first time for warmup \n7. Testing log: [workspace/perf.result.std.log (workspace/perf.result.std.log)\n8. code for testing: [src/application/app_yolo.cpp](src/application/app_yolo.cpp)\n9. images for testing: 6 images in workspace/inference \n    - with resolution 810x1080，500x806，1024x684，550x676，1280x720，800x533 respetively\n10. Testing method: load 6 images. Then do the inference on the 6 images, which will be repeated for 100 times. Note that each image should be preprocessed and postprocessed.\n\n---\n\n| Model    | Resolution | Type      | Precision | Elapsed Time | FPS    |\n| -------- | ---------- | --------- | --------- | ------------ | ------ |\n| yolox_x  | 640x640    | YoloX     | FP32      | 21.879       | 45.71  |\n| yolox_l  | 640x640    | YoloX     | FP32      | 12.308       | 81.25  |\n| yolox_m  | 640x640    | YoloX     | FP32      | 6.862        | 145.72 |\n| yolox_s  | 640x640    | YoloX     | FP32      | 3.088        | 323.81 |\n| yolox_x  | 640x640    | YoloX     | FP16      | 6.763        | 147.86 |\n| yolox_l  | 640x640    | YoloX     | FP16      | 3.933        | 254.25 |\n| yolox_m  | 640x640    | YoloX     | FP16      | 2.515        | 397.55 |\n| yolox_s  | 640x640    | YoloX     | FP16      | 1.362        | 734.48 |\n| yolox_x  | 640x640    | YoloX     | INT8      | 4.070        | 245.68 |\n| yolox_l  | 640x640    | YoloX     | INT8      | 2.444        | 409.21 |\n| yolox_m  | 640x640    | YoloX     | INT8      | 1.730        | 577.98 |\n| yolox_s  | 640x640    | YoloX     | INT8      | 1.060        | 943.15 |\n| yolov5x6 | 1280x1280  | YoloV5_P6 | FP32      | 68.022       | 14.70  |\n| yolov5l6 | 1280x1280  | YoloV5_P6 | FP32      | 37.931       | 26.36  |\n| yolov5m6 | 1280x1280  | YoloV5_P6 | FP32      | 20.127       | 49.69  |\n| yolov5s6 | 1280x1280  | YoloV5_P6 | FP32      | 8.715        | 114.75 |\n| yolov5x  | 640x640    | YoloV5_P5 | FP32      | 18.480       | 54.11  |\n| yolov5l  | 640x640    | YoloV5_P5 | FP32      | 10.110       | 98.91  |\n| yolov5m  | 640x640    | YoloV5_P5 | FP32      | 5.639        | 177.33 |\n| yolov5s  | 640x640    | YoloV5_P5 | FP32      | 2.578        | 387.92 |\n| yolov5x6 | 1280x1280  | YoloV5_P6 | FP16      | 20.877       | 47.90  |\n| yolov5l6 | 1280x1280  | YoloV5_P6 | FP16      | 10.960       | 91.24  |\n| yolov5m6 | 1280x1280  | YoloV5_P6 | FP16      | 7.236        | 138.20 |\n| yolov5s6 | 1280x1280  | YoloV5_P6 | FP16      | 3.851        | 259.68 |\n| yolov5x  | 640x640    | YoloV5_P5 | FP16      | 5.933        | 168.55 |\n| yolov5l  | 640x640    | YoloV5_P5 | FP16      | 3.450        | 289.86 |\n| yolov5m  | 640x640    | YoloV5_P5 | FP16      | 2.184        | 457.90 |\n| yolov5s  | 640x640    | YoloV5_P5 | FP16      | 1.307        | 765.10 |\n| yolov5x6 | 1280x1280  | YoloV5_P6 | INT8      | 12.207       | 81.92  |\n| yolov5l6 | 1280x1280  | YoloV5_P6 | INT8      | 7.221        | 138.49 |\n| yolov5m6 | 1280x1280  | YoloV5_P6 | INT8      | 5.248        | 190.55 |\n| yolov5s6 | 1280x1280  | YoloV5_P6 | INT8      | 3.149        | 317.54 |\n| yolov5x  | 640x640    | YoloV5_P5 | INT8      | 3.704        | 269.97 |\n| yolov5l  | 640x640    | YoloV5_P5 | INT8      | 2.255        | 443.53 |\n| yolov5m  | 640x640    | YoloV5_P5 | INT8      | 1.674        | 597.40 |\n| yolov5s  | 640x640    | YoloV5_P5 | INT8      | 1.143        | 874.91 |\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eapp_yolo_fast.cpp speed testing. Never stop desiring for being faster\u003c/summary\u003e\n  \n- \u003cb\u003eHighlight:\u003c/b\u003e   0.5 ms faster without any loss in precision compared with the above. Specifically, we remove the Focus and some transpose nodes etc, and implement them in CUDA kenerl function. But the rest remains the same.\n- \u003cb\u003eTest log:\u003c/b\u003e   [workspace/perf.result.std.log](workspace/perf.result.std.log)\n- \u003cb\u003eCode for testing:\u003c/b\u003e   [src/application/app_yolo_fast.cpp](src/application/app_yolo_fast.cpp)\n- \u003cb\u003eTips:\u003c/b\u003e   you can do the modification while refering to the downloaded onnx. Any questions are welcomed through any kinds of contact.\n- \u003cb\u003eConclusion:\u003c/b\u003e   the main idea of this work is to optimize the pre-and-post processing. If you go for yolox, yolov5 small version, the optimization might help you.\n\n|Model|Resolution|Type|Precision|Elapsed Time|FPS|\n|---|---|---|---|---|---|\n|yolox_x_fast|640x640|YoloX|FP32|21.598 |46.30 |\n|yolox_l_fast|640x640|YoloX|FP32|12.199 |81.97 |\n|yolox_m_fast|640x640|YoloX|FP32|6.819 |146.65 |\n|yolox_s_fast|640x640|YoloX|FP32|2.979 |335.73 |\n|yolox_x_fast|640x640|YoloX|FP16|6.764 |147.84 |\n|yolox_l_fast|640x640|YoloX|FP16|3.866 |258.64 |\n|yolox_m_fast|640x640|YoloX|FP16|2.386 |419.16 |\n|yolox_s_fast|640x640|YoloX|FP16|1.259 |794.36 |\n|yolox_x_fast|640x640|YoloX|INT8|3.918 |255.26 |\n|yolox_l_fast|640x640|YoloX|INT8|2.292 |436.38 |\n|yolox_m_fast|640x640|YoloX|INT8|1.589 |629.49 |\n|yolox_s_fast|640x640|YoloX|INT8|0.954 |1048.47 |\n|yolov5x6_fast|1280x1280|YoloV5_P6|FP32|67.075 |14.91 |\n|yolov5l6_fast|1280x1280|YoloV5_P6|FP32|37.491 |26.67 |\n|yolov5m6_fast|1280x1280|YoloV5_P6|FP32|19.422 |51.49 |\n|yolov5s6_fast|1280x1280|YoloV5_P6|FP32|7.900 |126.57 |\n|yolov5x_fast|640x640|YoloV5_P5|FP32|18.554 |53.90 |\n|yolov5l_fast|640x640|YoloV5_P5|FP32|10.060 |99.41 |\n|yolov5m_fast|640x640|YoloV5_P5|FP32|5.500 |181.82 |\n|yolov5s_fast|640x640|YoloV5_P5|FP32|2.342 |427.07 |\n|yolov5x6_fast|1280x1280|YoloV5_P6|FP16|20.538 |48.69 |\n|yolov5l6_fast|1280x1280|YoloV5_P6|FP16|10.404 |96.12 |\n|yolov5m6_fast|1280x1280|YoloV5_P6|FP16|6.577 |152.06 |\n|yolov5s6_fast|1280x1280|YoloV5_P6|FP16|3.087 |323.99 |\n|yolov5x_fast|640x640|YoloV5_P5|FP16|5.919 |168.95 |\n|yolov5l_fast|640x640|YoloV5_P5|FP16|3.348 |298.69 |\n|yolov5m_fast|640x640|YoloV5_P5|FP16|2.015 |496.34 |\n|yolov5s_fast|640x640|YoloV5_P5|FP16|1.087 |919.63 |\n|yolov5x6_fast|1280x1280|YoloV5_P6|INT8|11.236 |89.00 |\n|yolov5l6_fast|1280x1280|YoloV5_P6|INT8|6.235 |160.38 |\n|yolov5m6_fast|1280x1280|YoloV5_P6|INT8|4.311 |231.97 |\n|yolov5s6_fast|1280x1280|YoloV5_P6|INT8|2.139 |467.45 |\n|yolov5x_fast|640x640|YoloV5_P5|INT8|3.456 |289.37 |\n|yolov5l_fast|640x640|YoloV5_P5|INT8|2.019 |495.41 |\n|yolov5m_fast|640x640|YoloV5_P5|INT8|1.425 |701.71 |\n|yolov5s_fast|640x640|YoloV5_P5|INT8|0.844 |1185.47 |\n  \n\u003c/details\u003e\n\n## Setup and Configuration\n\u003cdetails\u003e\n\u003csummary\u003eLinux\u003c/summary\u003e\n  \n  \n1. VSCode (highly recommended!)\n2. Configure your path for cudnn, cuda, tensorRT8.0 and protobuf.\n3. Configure the compute capability matched with your nvidia graphics card in Makefile/CMakeLists.txt\n    - e.g.  `-gencode=arch=compute_75,code=sm_75`. If you are using 3080Ti, that should be `gencode=arch=compute_86,code=sm_86`\n    - reference for the table for GPU Compute Capability:\n  https://developer.nvidia.com/cuda-gpus#compute\n4. Configure your library path in .vscode/c_cpp_properties.json\n5. CUDA version: CUDA10.2\n6. CUDNN version: cudnn8.2.2.26. Note that dev(.h file) and runtime(.so file) should be downloaded.\n7. tensorRT version：tensorRT-8.0.1.6-cuda10.2\n8. protobuf version（for onnx parser）：protobufv3.11.4\n    - if other version, refer to the ........\n    - link for download: https://github.com/protocolbuffers/protobuf/tree/v3.11.4\n    - download, compile and replace the path in Makefile/CMakeLists.txt with new path to protobuf3.11.4\n  - CMake:\n    - `mkdir build \u0026\u0026 cd build`\n    - `cmake ..`\n    - `make yolo -j8`\n  - Makefile:\n    - `make yolo -j8`\n  \n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eLinux: Compile for Python\u003c/summary\u003e\n\n- compile and install\n    - Makefile：\n        - set `use_python := true` in Makefile\n    - CMakeLists.txt:\n        - `set(HAS_PYTHON ON)` in CMakeLists.txt\n    - Type in `make pyinstall -j8`\n    - Complied files are in `python/pytrt/libpytrtc.so`\n\n\u003c/details\u003e\n  \n\u003cdetails\u003e\n\u003csummary\u003eWindows\u003c/summary\u003e\n\n  \n1. Please check the [lean/README.md](lean/README.md) for the detailed dependency\n2. In TensorRT.vcxproj, replace the `\u003cImport Project=\"$(VCTargetsPath)\\BuildCustomizations\\CUDA 10.0.props\" /\u003e` with your own CUDA path\n3. In TensorRT.vcxproj, replace the `\u003cImport Project=\"$(VCTargetsPath)\\BuildCustomizations\\CUDA 10.0.targets\" /\u003e` with your own CUDA path\n4. In TensorRT.vcxproj, replace the `\u003cCodeGeneration\u003ecompute_61,sm_61\u003c/CodeGeneration\u003e` with your compute capability.\n    - refer to the table in https://developer.nvidia.com/cuda-gpus#compute\n  \n5. Configure your dependency or download it to the foler /lean. Configure VC++ dir (include dir and refence)\n\n6. Configure your env, debug-\u003eenvironment\n7. Compile and run the example, where 3 options are available.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eWindows: Compile for Python\u003c/summary\u003e\n\n  \n1. Compile pytrtc.pyd. Choose python in visual studio to compile\n2. Copy dll and execute 'python/copy_dll_to_pytrt.bat'\n3. Execute the example in python dir by 'python test_yolov5.py'\n  - if installation is needed, switch to target env(e.g. your conda env) then 'python setup.py install', which has to be followed by step 1 and step 2.\n  - the compiled files are in `python/pytrt/libpytrtc.pyd`\n\n\u003c/details\u003e\n  \n  \n\u003cdetails\u003e\n\u003csummary\u003eOther Protobuf Version\u003c/summary\u003e\n  \n- in onnx/make_pb.sh, replace the path `protoc=/data/sxai/lean/protobuf3.11.4/bin/protoc` in protoc with the protoc of your own version\n\n```bash\n#cd the path in terminal to /onnx\ncd onnx\n\n#execuete the command to make pb files\nbash make_pb.sh\n```\n  \n- CMake:\n    - replace the `set(PROTOBUF_DIR \"/data/sxai/lean/protobuf3.11.4\")` in CMakeLists.txt with the same path of your protoc.\n\n```bash\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake yolo -j64\n```\n- Makefile:\n    - replace the path `lean_protobuf  := /data/sxai/lean/protobuf3.11.4` in Makefile with the same path of protoc\n\n```bash\nmake yolo -j64\n```\n\n\u003c/details\u003e\n  \n\n\u003cdetails\u003e\n\u003csummary\u003eTensorRT 7.x support\u003c/summary\u003e\n\n- The default is tensorRT8.x\n1. Replace onnx_parser_for_7.x/onnx_parser to src/tensorRT/onnx_parser\n    - `bash onnx_parser/use_tensorrt_7.x.sh`\n2. Configure Makefile/CMakeLists.txt path to TensorRT7.x\n3. Execute `make yolo -j64`\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eTensorRT 8.x support\u003c/summary\u003e\n\n- The default is tensorRT8.x\n1. Replace onnx_parser_for_8.x/onnx_parser to src/tensorRT/onnx_parser\n    - `bash onnx_parser/use_tensorrt_8.x.sh`\n2. Configure Makefile/CMakeLists.txt path to TensorRT8.x\n3. Execute `make yolo -j64`\n\n\u003c/details\u003e\n  \n  \n## Guide for Different Tasks/Model Support\n\u003cdetails\u003e\n\u003csummary\u003eYoloV5 Support\u003c/summary\u003e\n  \n- if pytorch \u003e= 1.7, and the model is 5.0+, the model is suppored by the framework \n- if pytorch \u003c 1.7 or yolov5(2.0, 3.0 or 4.0), minor modification should be done in opset.\n- if you want to achieve the inference with lower pytorch, dynamic batchsize and other advanced setting, please check our [blog](http://zifuture.com:8090) (now in Chinese) and scan the QRcode via Wechat to join us.\n\n\n1. Download yolov5\n\n```bash\ngit clone git@github.com:ultralytics/yolov5.git\n```\n\n2. Modify the code for dynamic batchsize\n```python\n# line 55 forward function in yolov5/models/yolo.py \n# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)\n# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n# modified into:\n\nbs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)\nbs = -1\nny = int(ny)\nnx = int(nx)\nx[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n\n# line 70 in yolov5/models/yolo.py\n#  z.append(y.view(bs, -1, self.no))\n# modified into：\nz.append(y.view(bs, self.na * ny * nx, self.no))\n\n############# for yolov5-6.0 #####################\n# line 65 in yolov5/models/yolo.py\n# if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:\n#    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)\n# modified into:\nif self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:\n    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)\n\n# disconnect for pytorch trace\nanchor_grid = (self.anchors[i].clone() * self.stride[i]).view(1, -1, 1, 1, 2)\n\n# line 70 in yolov5/models/yolo.py\n# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n# modified into:\ny[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh\n\n# line 73 in yolov5/models/yolo.py\n# wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n# modified into:\nwh = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh\n############# for yolov5-6.0 #####################\n\n\n# line 52 in yolov5/export.py\n# torch.onnx.export(dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # shape(1,3,640,640)\n#                                'output': {0: 'batch', 1: 'anchors'}  # shape(1,25200,85)  修改为\n# modified into:\ntorch.onnx.export(dynamic_axes={'images': {0: 'batch'},  # shape(1,3,640,640)\n                                'output': {0: 'batch'}  # shape(1,25200,85) \n```\n3. Export to onnx model\n```bash\ncd yolov5\npython export.py --weights=yolov5s.pt --dynamic --include=onnx --opset=11\n```\n4. Copy the model and execute it\n```bash\ncp yolov5/yolov5s.onnx tensorRT_cpp/workspace/\ncd tensorRT_cpp\nmake yolo -j32\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eYoloV7 Support\u003c/summary\u003e\n1. Download yolov7 and pth\n\n```bash\n# from cdn\n# or wget https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt\n\nwget https://cdn.githubjs.cf/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt\ngit clone git@github.com:WongKinYiu/yolov7.git\n```\n\n2. Modify the code for dynamic batchsize\n```python\n# line 45 forward function in yolov7/models/yolo.py \n# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)\n# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n# modified into:\n\nbs, _, ny, nx = map(int, x[i].shape)  # x(bs,255,20,20) to x(bs,3,20,20,85)\nbs = -1\nx[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n\n# line 52 in yolov7/models/yolo.py\n# y = x[i].sigmoid()\n# y[..., 0:2] = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy\n# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n# z.append(y.view(bs, -1, self.no))\n# modified into：\ny = x[i].sigmoid()\nxy = (y[..., 0:2] * 2. - 0.5 + self.grid[i]) * self.stride[i]  # xy\nwh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i].view(1, -1, 1, 1, 2)  # wh\nclassif = y[..., 4:]\ny = torch.cat([xy, wh, classif], dim=-1)\nz.append(y.view(bs, self.na * ny * nx, self.no))\n\n# line 57 in yolov7/models/yolo.py\n# return x if self.training else (torch.cat(z, 1), x)\n# modified into:\nreturn x if self.training else torch.cat(z, 1)\n\n\n# line 52 in yolov7/models/export.py\n# output_names=['classes', 'boxes'] if y is None else ['output'],\n# dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # size(1,3,640,640)\n#               'output': {0: 'batch', 2: 'y', 3: 'x'}} if opt.dynamic else None)\n# modified into:\noutput_names=['classes', 'boxes'] if y is None else ['output'],\ndynamic_axes={'images': {0: 'batch'},  # size(1,3,640,640)\n              'output': {0: 'batch'}} if opt.dynamic else None)\n\n```\n3. Export to onnx model\n```bash\ncd yolov7\npython models/export.py --dynamic --grid --weight=yolov7.pt\n```\n4. Copy the model and execute it\n```bash\ncp yolov7/yolov7.onnx tensorRT_cpp/workspace/\ncd tensorRT_cpp\nmake yolo -j32\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eYoloX Support\u003c/summary\u003e\n  \n- download from: https://github.com/Megvii-BaseDetection/YOLOX\n- If you don't want to export onnx by yourself, just make run in the repo of Megavii\n\n1. Download YoloX\n```bash\ngit clone git@github.com:Megvii-BaseDetection/YOLOX.git\ncd YOLOX\n```\n\n2. Modify the code\nThe modification ensures a successful int8 compilation and inference, otherwise `Missing scale and zero-point for tensor (Unnamed Layer* 686)` will be raised.\n  \n```Python\n# line 206 forward fuction in yolox/models/yolo_head.py. Replace the commented code with the uncommented code\n# self.hw = [x.shape[-2:] for x in outputs] \nself.hw = [list(map(int, x.shape[-2:])) for x in outputs]\n\n\n# line 208 forward function in yolox/models/yolo_head.py. Replace the commented code with the uncommented code\n# [batch, n_anchors_all, 85]\n# outputs = torch.cat(\n#     [x.flatten(start_dim=2) for x in outputs], dim=2\n# ).permute(0, 2, 1)\nproc_view = lambda x: x.view(-1, int(x.size(1)), int(x.size(2) * x.size(3)))\noutputs = torch.cat(\n    [proc_view(x) for x in outputs], dim=2\n).permute(0, 2, 1)\n\n\n# line 253 decode_output function in yolox/models/yolo_head.py Replace the commented code with the uncommented code\n#outputs[..., :2] = (outputs[..., :2] + grids) * strides\n#outputs[..., 2:4] = torch.exp(outputs[..., 2:4]) * strides\n#return outputs\nxy = (outputs[..., :2] + grids) * strides\nwh = torch.exp(outputs[..., 2:4]) * strides\nreturn torch.cat((xy, wh, outputs[..., 4:]), dim=-1)\n\n# line 77 in tools/export_onnx.py\nmodel.head.decode_in_inference = True\n```\n\n \n3. Export to onnx\n```bash\n\n# download model\nwget https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_m.pth\n\n# export\nexport PYTHONPATH=$PYTHONPATH:.\npython tools/export_onnx.py -c yolox_m.pth -f exps/default/yolox_m.py --output-name=yolox_m.onnx --dynamic --no-onnxsim\n```\n\n4. Execute the command\n```bash\ncp YOLOX/yolox_m.onnx tensorRT_cpp/workspace/\ncd tensorRT_cpp\nmake yolo -j32\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eYoloV3 Support\u003c/summary\u003e\n  \n- if pytorch \u003e= 1.7, and the model is 5.0+, the model is suppored by the framework \n- if pytorch \u003c 1.7 or yolov3, minor modification should be done in opset.\n- if you want to achieve the inference with lower pytorch, dynamic batchsize and other advanced setting, please check our [blog](http://zifuture.com:8090) (now in Chinese) and scan the QRcode via Wechat to join us.\n\n\n1. Download yolov3\n\n```bash\ngit clone git@github.com:ultralytics/yolov3.git\n```\n\n2. Modify the code for dynamic batchsize\n```python\n# line 55 forward function in yolov3/models/yolo.py \n# bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)\n# x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n# modified into:\n\nbs, _, ny, nx = map(int, x[i].shape)  # x(bs,255,20,20) to x(bs,3,20,20,85)\nbs = -1\nx[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()\n\n\n# line 70 in yolov3/models/yolo.py\n#  z.append(y.view(bs, -1, self.no))\n# modified into：\nz.append(y.view(bs, self.na * ny * nx, self.no))\n\n# line 62 in yolov3/models/yolo.py\n# if self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:\n#    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)\n# modified into:\nif self.grid[i].shape[2:4] != x[i].shape[2:4] or self.onnx_dynamic:\n    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)\nanchor_grid = (self.anchors[i].clone() * self.stride[i]).view(1, -1, 1, 1, 2)\n\n# line 70 in yolov3/models/yolo.py\n# y[..., 2:4] = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n# modified into:\ny[..., 2:4] = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh\n\n# line 73 in yolov3/models/yolo.py\n# wh = (y[..., 2:4] * 2) ** 2 * self.anchor_grid[i]  # wh\n# modified into:\nwh = (y[..., 2:4] * 2) ** 2 * anchor_grid  # wh\n\n\n# line 52 in yolov3/export.py\n# torch.onnx.export(dynamic_axes={'images': {0: 'batch', 2: 'height', 3: 'width'},  # shape(1,3,640,640)\n#                                'output': {0: 'batch', 1: 'anchors'}  # shape(1,25200,85) \n# modified into:\ntorch.onnx.export(dynamic_axes={'images': {0: 'batch'},  # shape(1,3,640,640)\n                                'output': {0: 'batch'}  # shape(1,25200,85) \n```\n3. Export to onnx model\n```bash\ncd yolov3\npython export.py --weights=yolov3.pt --dynamic --include=onnx --opset=11\n```\n4. Copy the model and execute it\n```bash\ncp yolov3/yolov3.onnx tensorRT_cpp/workspace/\ncd tensorRT_cpp\n\n# change src/application/app_yolo.cpp: main\n# test(Yolo::Type::V3, TRT::Mode::FP32, \"yolov3\");\n\nmake yolo -j32\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eUNet Support\u003c/summary\u003e\n  \n- reference to : https://github.com/shouxieai/unet-pytorch\n\n```\nmake dunet -j32\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eRetinaface Support\u003c/summary\u003e\n\n- https://github.com/biubug6/Pytorch_Retinaface\n\n1. Download Pytorch_Retinaface Repo\n\n```bash\ngit clone git@github.com:biubug6/Pytorch_Retinaface.git\ncd Pytorch_Retinaface\n```\n\n2. Download model from the Training of README.md in https://github.com/biubug6/Pytorch_Retinaface#training .Then unzip it to the /weights . Here, we use mobilenet0.25_Final.pth\n\n3. Modify the code\n\n```python\n# line 24 in models/retinaface.py\n# return out.view(out.shape[0], -1, 2) is modified into \nreturn out.view(-1, int(out.size(1) * out.size(2) * 2), 2)\n\n# line 35 in models/retinaface.py\n# return out.view(out.shape[0], -1, 4) is modified into\nreturn out.view(-1, int(out.size(1) * out.size(2) * 2), 4)\n\n# line 46 in models/retinaface.py\n# return out.view(out.shape[0], -1, 10) is modified into\nreturn out.view(-1, int(out.size(1) * out.size(2) * 2), 10)\n\n# The following modification ensures the output of resize node is based on scale rather than shape such that dynamic batch can be achieved.\n# line 89 in models/net.py\n# up3 = F.interpolate(output3, size=[output2.size(2), output2.size(3)], mode=\"nearest\") is modified into\nup3 = F.interpolate(output3, scale_factor=2, mode=\"nearest\")\n\n# line 93 in models/net.py\n# up2 = F.interpolate(output2, size=[output1.size(2), output1.size(3)], mode=\"nearest\") is modified into\nup2 = F.interpolate(output2, scale_factor=2, mode=\"nearest\")\n\n# The following code removes softmax (bug sometimes happens). At the same time, concatenate the output to simplify the decoding.\n# line 123 in models/retinaface.py\n# if self.phase == 'train':\n#     output = (bbox_regressions, classifications, ldm_regressions)\n# else:\n#     output = (bbox_regressions, F.softmax(classifications, dim=-1), ldm_regressions)\n# return output\n# the above is modified into:\noutput = (bbox_regressions, classifications, ldm_regressions)\nreturn torch.cat(output, dim=-1)\n\n# set 'opset_version=11' to ensure a successful export\n# torch_out = torch.onnx._export(net, inputs, output_onnx, export_params=True, verbose=False,\n#     input_names=input_names, output_names=output_names)\n# is modified into:\ntorch_out = torch.onnx._export(net, inputs, output_onnx, export_params=True, verbose=False, opset_version=11,\n    input_names=input_names, output_names=output_names)\n\n\n\n\n```\n4. Export to onnx\n```bash\npython convert_to_onnx.py\n```\n\n5. Execute\n```bash\ncp FaceDetector.onnx ../tensorRT_cpp/workspace/mb_retinaface.onnx\ncd ../tensorRT_cpp\nmake retinaface -j64\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eDBFace Support\u003c/summary\u003e\n\n- https://github.com/dlunion/DBFace\n\n```bash\nmake dbface -j64\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eScrfd Support\u003c/summary\u003e\n\n- https://github.com/deepinsight/insightface/tree/master/detection/scrfd\n- The know-how about exporting to onnx is comming. Before it is released, come and join us to disucss. \n\n\u003c/details\u003e\n\n\n\n\u003cdetails\u003e\n\u003csummary\u003eArcface Support\u003c/summary\u003e\n\n- https://github.com/deepinsight/insightface/tree/master/recognition/arcface_torch\n```C++\nauto arcface = Arcface::create_infer(\"arcface_iresnet50.fp32.trtmodel\", 0);\nauto feature = arcface-\u003ecommit(make_tuple(face, landmarks)).get();\ncout \u003c\u003c feature \u003c\u003c endl;  // 1x512\n```\n- In the example of Face Recognition, `workspace/face/library` is the set of faces registered.\n- `workspace/face/recognize` is the set of face to be recognized.\n- the result is saved in `workspace/face/result`和`workspace/face/library_draw`\n\n\u003c/details\u003e\n  \n\u003cdetails\u003e\n\u003csummary\u003eCenterNet Support\u003c/summary\u003e\n  \ncheck the great details in tutorial/2.0\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eBert Support(Chinese Classification)\u003c/summary\u003e\n\n- https://github.com/649453932/Bert-Chinese-Text-Classification-Pytorch\n- `make bert -j6`  \n\n\u003c/details\u003e\n\n\n## the INTRO to Interface\n\n\u003cdetails\u003e\n\u003csummary\u003ePython Interface：Get onnx and trtmodel from pytorch model more easily\u003c/summary\u003e\n\n- Just one line of code to export onnx and trtmodel. And save them for usage in the future.\n```python\nimport pytrt\n\nmodel = models.resnet18(True).eval()\npytrt.from_torch(\n    model, \n    dummy_input, \n    max_batch_size=16, \n    onnx_save_file=\"test.onnx\", \n    engine_save_file=\"engine.trtmodel\"\n)\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePython Interface：TensorRT Inference\u003c/summary\u003e\n\n- YoloX TensorRT Inference\n```python\nimport pytrt\n\nyolo   = tp.Yolo(engine_file, type=tp.YoloType.X)   # engine_file is the trtmodel file\nimage  = cv2.imread(\"inference/car.jpg\")\nbboxes = yolo.commit(image).get()\n```\n\n- Seamless Inference from Pytorch to TensorRT\n```python\nimport pytrt\n\nmodel     = models.resnet18(True).eval().to(device) # pt model\ntrt_model = tp.from_torch(model, input)\ntrt_out   = trt_model(input)\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eC++ Interface：YoloX Inference\u003c/summary\u003e\n\n```C++\n\n// create infer engine on gpu 0\nauto engine = Yolo::create_infer(\"yolox_m.fp32.trtmodel\"， Yolo::Type::X, 0);\n\n// load image\nauto image = cv::imread(\"1.jpg\");\n\n// do inference and get the result\nauto box = engine-\u003ecommit(image).get();\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eC++ Interface：Compile Model in FP32/FP16\u003c/summary\u003e\n\n```cpp\nTRT::compile(\n  TRT::Mode::FP32,   // compile model in fp32\n  3,                          // max batch size\n  \"plugin.onnx\",              // onnx file\n  \"plugin.fp32.trtmodel\",     // save path\n  {}                         //  redefine the shape of input when needed\n);\n```\n- For fp32 compilation, all you need is offering onnx file whose input shape is allowed to be redefined.\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eC++ Interface：Compile in int8\u003c/summary\u003e\n\n- The in8 inference performs slightly worse than fp32 in precision(about -5% drop down), but stunningly faster. In the framework, we offer int8 inference\n\n```cpp\n// define int8 calibration function to read data and handle it to tenor.\nauto int8process = [](int current, int count, vector\u003cstring\u003e\u0026 images, shared_ptr\u003cTRT::Tensor\u003e\u0026 tensor){\n    for(int i = 0; i \u003c images.size(); ++i){\n    // int8 compilation requires calibration. We read image data and set_norm_mat. Then the data will be transfered into the tensor.\n        auto image = cv::imread(images[i]);\n        cv::resize(image, image, cv::Size(640, 640));\n        float mean[] = {0, 0, 0};\n        float std[]  = {1, 1, 1};\n        tensor-\u003eset_norm_mat(i, image, mean, std);\n    }\n};\n\n\n// Specify TRT::Mode as INT8\nauto model_file = \"yolov5m.int8.trtmodel\";\nTRT::compile(\n  TRT::Mode::INT8,            // INT8\n  3,                          // max batch size\n  \"yolov5m.onnx\",             // onnx\n  model_file,                 // saved filename\n  {},                         // redefine the input shape\n  int8process,                // the recall function for calibration\n  \".\",                        // the dir where the image data is used for calibration\n  \"\"                          // the dir where the data generated from calibration is saved(a.k.a where to load the calibration data.)\n);\n```\n- We integrate into only one int8process function to save otherwise a lot of issues that might happen in tensorRT official implementation. \n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eC++ Interface：Inference\u003c/summary\u003e\n\n- We introduce class Tensor for easier inference and data transfer between host to device. So that as a user, the details wouldn't be annoying.\n\n- class Engine is another facilitator.\n\n```cpp\n// load model and get a shared_ptr. get nullptr if fail to load.\nauto engine = TRT::load_infer(\"yolov5m.fp32.trtmodel\");\n\n// print model info\nengine-\u003eprint();\n\n// load image\nauto image = imread(\"demo.jpg\");\n\n// get the model input and output node, which can be accessed by name or index\nauto input = engine-\u003einput(0);   // or auto input = engine-\u003einput(\"images\");\nauto output = engine-\u003eoutput(0); // or auto output = engine-\u003eoutput(\"output\");\n\n// put the image into input tensor by calling set_norm_mat()\nfloat mean[] = {0, 0, 0};\nfloat std[]  = {1, 1, 1};\ninput-\u003eset_norm_mat(i, image, mean, std);\n\n// do the inference. Here sync(true) or async(false) is optional\nengine-\u003eforward(); // engine-\u003eforward(true or false)\n\n// get the outut_ptr, which can used to access the output\nfloat* output_ptr = output-\u003ecpu\u003cfloat\u003e();\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eC++ Interface：Plugin\u003c/summary\u003e\n\n- You only need to define kernel function and inference process. The details of code(e.g the serialization, deserialization and injection of plugin etc) are under the hood.\n- Easy to implement a new plugin in FP32 and FP16. Refer to HSwish.cu for details.\n```cpp\ntemplate\u003c\u003e\n__global__ void HSwishKernel(float* input, float* output, int edge) {\n\n    KernelPositionBlock;\n    float x = input[position];\n    float a = x + 3;\n    a = a \u003c 0 ? 0 : (a \u003e= 6 ? 6 : a);\n    output[position] = x * a / 6;\n}\n\nint HSwish::enqueue(const std::vector\u003cGTensor\u003e\u0026 inputs, std::vector\u003cGTensor\u003e\u0026 outputs, const std::vector\u003cGTensor\u003e\u0026 weights, void* workspace, cudaStream_t stream) {\n\n    int count = inputs[0].count();\n    auto grid = CUDATools::grid_dims(count);\n    auto block = CUDATools::block_dims(count);\n    HSwishKernel \u003c\u003c\u003cgrid, block, 0, stream \u003e\u003e\u003e (inputs[0].ptr\u003cfloat\u003e(), outputs[0].ptr\u003cfloat\u003e(), count);\n    return 0;\n}\n\n\nRegisterPlugin(HSwish);\n```\n\n\u003c/details\u003e\n\n\n## About Us\n- Our blog：http://www.zifuture.com/                        (Now only in Chinese. English is comming)\n- Our video channel： https://space.bilibili.com/1413433465 (Now only in Chinese. English is comming)\n\n\n\n\n\n\n\n\n\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshouxieai%2Ftensorrt_pro","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshouxieai%2Ftensorrt_pro","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshouxieai%2Ftensorrt_pro/lists"}