{"id":13643584,"url":"https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization","last_synced_at":"2025-04-21T02:30:41.544Z","repository":{"id":47616934,"uuid":"515276522","full_name":"NVIDIA-AI-IOT/yolov5_gpu_optimization","owner":"NVIDIA-AI-IOT","description":"This repository provides YOLOV5 GPU optimization sample","archived":false,"fork":false,"pushed_at":"2023-01-06T02:32:41.000Z","size":57,"stargazers_count":100,"open_issues_count":4,"forks_count":27,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-11-09T15:43:01.171Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA-AI-IOT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-07-18T17:18:03.000Z","updated_at":"2024-10-30T08:51:33.000Z","dependencies_parsed_at":"2023-02-05T03:17:01.399Z","dependency_job_id":null,"html_url":"https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fyolov5_gpu_optimization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fyolov5_gpu_optimization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fyolov5_gpu_optimization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA-AI-IOT%2Fyolov5_gpu_optimization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA-AI-IOT","download_url":"https://codeload.github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249986028,"owners_count":21356310,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:01:49.632Z","updated_at":"2025-04-21T02:30:40.528Z","avatar_url":"https://github.com/NVIDIA-AI-IOT.png","language":"Python","funding_links":[],"categories":["Lighter and Deployment Frameworks"],"sub_categories":[],"readme":"# YOLOV5 inference solution in DeepStream and TensorRT\nThis repo provides sample codes to deploy YOLOV5 models in DeepStream or stand-alone TensorRT sample on Nvidia devices.\n\n* [DeepStream sample](#deepstream-sample)\n* [TensorRT sample](#tensorrt-sample)\n* [Appendix](#appendix)\n\n## DeepStream sample\nIn this section, we will walk through the steps to run YOLOV5 model using DeepStream with CPU NMS.\n### Export the ultralytics YOLOV5 model to ONNX with TRT decode plugin\nYou could start from nvcr.io/nvidia/pytorch:22.03-py3 container for export.\n```\ngit clone https://github.com/ultralytics/yolov5.git\n# clone yolov5_trt_infer repo and copy the patch into yolov5 folder\ngit clone https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization.git\ncp yolov5_gpu_optimization/0001-Enable-onnx-export-with-decode-plugin.patch yolov5_gpu_optimization/requirement_export.txt yolov5/\ncd yolov5\ngit checkout a80dd66efe0bc7fe3772f259260d5b7278aab42f\ngit am 0001-Enable-onnx-export-with-decode-plugin.patch\npip install -r requirement_export.txt\napt update \u0026\u0026 apt install -y libgl1-mesa-glx \npython export.py --weights yolov5s.pt --include onnx --simplify --dynamic\n```\n### Prepare the library for DeepStream inference.\nYou could start from nvcr.io/nvidia/deepstream:6.1.1-devel container for inference.\n\nThen go to the deepstream sample directory.\n```\ncd deepstream-sample\n```\nCompile the plugin and deepstream parser:\n\n* On x86:\n    ```\n    nvcc -Xcompiler -fPIC -shared -o yolov5_decode.so ./yoloForward_nc.cu ./yoloPlugins.cpp ./nvdsparsebbox_Yolo.cpp -isystem /usr/include/x86_64-linux-gnu/ -L /usr/lib/x86_64-linux-gnu/ -I /opt/nvidia/deepstream/deepstream/sources/includes -lnvinfer \n    ```\n* On Jetson device:\n    ```\n    nvcc -Xcompiler -fPIC -shared -o yolov5_decode.so ./yoloForward_nc.cu ./yoloPlugins.cpp ./nvdsparsebbox_Yolo.cpp -isystem /usr/include/aarch64-linux-gnu/ -L /usr/lib/aarch64-linux-gnu/ -I /opt/nvidia/deepstream/deepstream/sources/includes -lnvinfer \n    ```\n### Run inference\nYou could place the exported onnx models to `deepstream-sample`\n```\ncp yolov5/yolov5s.onnx yolov5_gpu_optimization/deepstream-sample/\n```\nThen you could run the model pre-defined configs.\n\n* Run inference with saving inferened video:\n    ```\n    deepstream-app -c config/deepstream_app_config_save_video.txt \n    ```\n* Run inference without display\n    ```\n    deepstream-app -c config/deepstream_app_config.txt \n    ```\n* Run inference with 8 streams and batch_size=8 and without display\n    ```\n    deepstream-app -c config/deepstream_app_config_8s.txt \n    ```\n\n### Performance summary:\nThe performance test is conducted on T4 with nvcr.io/nvidia/deepstream:6.1.1-devel\n\n| Model   | Input Size | Device | precision | 1 stream bs=1 | 4 streams bs=4 | 8 streams bs=8 |\n|---------|------------|--------|-----------|---------------|----------------|----------------|\n| yolov5n | 3x640x640  | T4     | FP16      | 640           | 980            | 988            |\n| yolov5m | 3x640x640  | T4     | FP16      | 220           | 270            | 277            |\n\n## TensorRT sample\nIn this section, we will walk through the steps to run YOLOV5 model using GPU NMS with stand-alone inference script.\n### Export the ultralytics YOLOV5 model to ONNX with TRT BatchNMS plugin\nYou could start from nvcr.io/nvidia/pytorch:22.03-py3 container for export.\n```\ngit clone https://github.com/ultralytics/yolov5.git\n# clone yolov5_trt_infer repo and copy files into yolov5 folder\ngit clone https://github.com/NVIDIA-AI-IOT/yolov5_gpu_optimization.git\ncp -r yolov5_gpu_optimization/0001-Enable-onnx-export-with-batchNMS-plugin.patch yolov5_gpu_optimization/requirement_export.txt yolov5/\ncd yolov5\ngit checkout a80dd66efe0bc7fe3772f259260d5b7278aab42f\ngit am 0001-Enable-onnx-export-with-batchNMS-plugin.patch\npip install -r requirement_export.txt\napt update \u0026\u0026 apt install -y libgl1-mesa-glx \npython export.py --weights yolov5s.pt --include onnx --simplify --dynamic\n```\n\n### Run with TensorRT:\n\nFor the following section, you could start from nvcr.io/nvidia/tensorrt:22.05-py3 and prepare env by:\n```\ncd tensorrt-sample\npip install -r requirement_infer.txt\napt update \u0026\u0026 apt install -y libgl1-mesa-glx \n```\n\nBuild plugin library by following the [previous steps](#prepare-the-library-for-deepstream-inference).\n#### Run inference\n```\npython yolov5_trt_inference.py --input_images_folder=\u003c/path/to/coco/images/val2017/\u003e --output_images_folder=./coco_output --onnx=\u003c/path/to/yolov5s.onnx\u003e\n```\n#### Run evaluation on COCO17 validation dataset\n\n##### Square inference evaluation:\nThe image will be resized to 3xINPUT_SIZExINPUT_SIZE while be kept aspect ratio.\n```\npython yolov5_trt_inference.py --input_images_folder=\u003c/path/to/coco/images/val2017/\u003e --output_images_folder=\u003cpath/to/coco_output_dir\u003e --onnx=\u003c/path/to/yolov5s.onnx\u003e --coco_anno=\u003c/path/to/coco/annotations/instances_val2017.json\u003e \n```\n\n##### Rectangular inference evaluation:\nThis is not real rectangular inference as in pytorch. It is same to setting `pad=0, rect=False, imgsz=input_size + stride` in ultralytics YOLOV5.\n```\n# Default FP16 precision\npython yolov5_trt_inference.py --input_images_folder=\u003c/path/to/coco/images/val2017/\u003e --output_images_folder=\u003cpath/to/coco_output_dir\u003e --onnx=\u003c/path/to/yolov5s.onnx\u003e --coco_anno=\u003c/path/to/coco/annotations/instances_val2017.json\u003e --rect\n```\n\n\n#### Eavaluation in INT8 mode\nTo run int8 inference or evaluation, you need to install TensorRT above 8.4. You could start from `nvcr.io/nvidia/tensorrt:22.07-py3`\n\nFollowing command is to run evaluation in int8 precision (and calibration cache will be saved into the path specify by `--calib_cache`):\n```\n# INT8 precision\npython yolov5_trt_inference.py --input_images_folder=\u003c/path/to/coco/images/val2017/\u003e --output_images_folder=\u003cpath/to/coco_output_dir\u003e --onnx=\u003c/path/to/yolov5s.onnx\u003e --coco_anno=\u003c/path/to/coco/annotations/instances_val2017.json\u003e --rect --data_type=int8 --save_engine=./yolov5s_int8_maxbs16.engine  --calib_img_dir=\u003c/path/to/coco/images/val2017/\u003e --calib_cache=yolov5s_bs16_n10.cache --n_batches=10 --batch_size=16 \n```\n\n**Notes**: The calibration algorithm for YOLOV5 is `IInt8MinMaxCalibrator` instead of `IInt8EntropyCalibrator2`. So if you want to play with `trtexec` with the saved calibration cache, you have to change the first line of cache from `MinMaxCalibration` to `EntropyCalibration2`.\n\n### Misc for TensorRT sample\n\n#### Performance\u0026\u0026mAP summary\nHere is the performance and mAP summary. Tested on V100 16G with TensorRT 8.2.5 in rectangular inference mode.\n\n| Model    | Input Size | precision | FPS bs=32 | FPS bs= 1 | mAP@0.5 |\n| -------- | ---------- | --------- | --------- | --------- | ------- |\n| yolov5n  | 640        | FP16      | 1295      | 448       | 45.9%   |\n| yolov5s  | 640        | FP16      | 917       | 378       | 57.1%   |\n| yolov5m  | 640        | FP16      | 614       | 282       | 64%     |\n| yolov5l  | 640        | FP16      | 416       | 202       | 67.3%   |\n| yolov5x  | 640        | FP16      | 231       | 135       | 68.5%   |\n| yolov5n6 | 1280       | FP16      | 341       | 160       | 54.2%   |\n| yolov5s6 | 1280       | FP16      | 261       | 139       | 63.2%   |\n| yolov5m6 | 1280       | FP16      | 155       | 99        | 68.8%   |\n| yolov5l6 | 1280       | FP16      | 106       | 68        | 70.7%   |\n| yolov5x6 | 1280       | FP16      | 60        | 45        | 71.9%   |\n\n#### nbit-NMS\nUsers can also enable nbit-NMS by changing the `scoreBits` in export.py. \n```python\n# Default to be 16-bit\nnms_attrs[\"scoreBits\"] = 16\n# Can be changed to smaller one to boost NMS operation:\n# e.g. nms_attrs[\"scoreBits\"] = 8\n```\nperformance gain:\n| Classes number | Device    | Anchors number | Score bits | Batch size | NMS Execution time (ms) |\n| -------------- | --------- | -------------  | ---------- | ---------- | ----------------------- |\n| 80             | A30       | 25200          | 16         | 32         | 12.1                    |\n| 80             | A30       | 25200          | 8          | 32         | 10.0                    |\n| 4              | Xavier NX | 10560          | 16         | 4          | 1.38                    |\n| 4              | Xavier NX | 10560          | 8          | 4          | 1.08                    |\n\n*Note*: small score bits may slightly decrease the final mAP. \n\n#### DeepStream deployment:\nUsers can intergrate the YOLOV5 with BatchedNMS plugin into DeepStream following [deepstream_tao_apps](https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps)\n\n## Appendix:\n### YOLOV5 with different activation:\nWe conducted experiments with different activations for pursing better trade-off between mAP and performance on TensorRT.\n\nYou can change the activation of YOLOV5 model in `yolov5/models/common.py`:\n```\nclass Conv(nn.Module):\n    # Standard convolution\n    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups\n        super().__init__()\n        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)\n        self.bn = nn.BatchNorm2d(c2)\n        # self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())\n        self.act = nn.ReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())\n\n    def forward(self, x):\n        return self.act(self.bn(self.conv(x)))\n\n    def forward_fuse(self, x):\n        return self.act(self.conv(x))\n```\n\nYOLOV5s experiments results so far:\n\n|     Activation type     |     mAP@0.5                                          |     V100 --best FPS (bs = 32)    |     A10  --best FPS (bs=32)    |\n|-------------------------|------------------------------------------------------|----------------------------------|--------------------------------|\n|     swish (baseline)    |     56.7%                                            |     1047                         |     965                        |\n|     ReLU                |     54.8% (scratch)\u003cbr\u003e55.7% (swish pretrained)      |     1177                         |     1065                       |\n|     GELU                |     56.6%                                            |     1004                         |     916                        |\n|     Leaky ReLU          |     55.0%                                            |     1172                         |     892                        |\n|     PReLU               |     54.8%                                            |     1123                         |     932                        |\n\n## Known issue:\n\n- int8 0% mAP in TensorRT 8.2.5: Install TensorRT above 8.4 to avoid the issue.\n- TensorRT warning at the end of the execution of stand-alone tensorrt inference script: The warning won't block the inference or evaluation. You can just ignore it.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA-AI-IOT%2Fyolov5_gpu_optimization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA-AI-IOT%2Fyolov5_gpu_optimization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA-AI-IOT%2Fyolov5_gpu_optimization/lists"}