{"id":20469075,"url":"https://github.com/pfnet-research/chainer-trt","last_synced_at":"2025-09-07T17:35:13.663Z","repository":{"id":86583107,"uuid":"159161390","full_name":"pfnet-research/chainer-trt","owner":"pfnet-research","description":"Chainer x TensorRT","archived":false,"fork":false,"pushed_at":"2019-03-20T11:16:45.000Z","size":1007,"stargazers_count":34,"open_issues_count":6,"forks_count":6,"subscribers_count":45,"default_branch":"master","last_synced_at":"2025-04-13T10:43:30.203Z","etag":null,"topics":["chainer","cpp","deep-learning","neural-network","python","tensorrt"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pfnet-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-11-26T11:51:17.000Z","updated_at":"2024-05-01T09:55:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"526d8f5f-4071-4d24-aa9a-84d1d06cd79b","html_url":"https://github.com/pfnet-research/chainer-trt","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pfnet-research/chainer-trt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfnet-research%2Fchainer-trt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfnet-research%2Fchainer-trt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfnet-research%2Fchainer-trt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfnet-research%2Fchainer-trt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pfnet-research","download_url":"https://codeload.github.com/pfnet-research/chainer-trt/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pfnet-research%2Fchainer-trt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274069786,"owners_count":25217196,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chainer","cpp","deep-learning","neural-network","python","tensorrt"],"created_at":"2024-11-15T14:07:50.314Z","updated_at":"2025-09-07T17:35:13.546Z","avatar_url":"https://github.com/pfnet-research.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# chainer-trt\n\nA toolkit for converting Chainer model to TensorRT inference engine, and run inference.\n\n## Concept and overview\n\n![concept diagram](images/concept_diagram.png)\n\n\n\n------\n## System requirements\n\n* Hardware requirements\n  * NVIDIA GPU supported by NVIDIA TensorRT\n    * Confirmed on Desktop PC and Jetson Xavier\n* System requirements\n  * NVIDIA TensorRT 5.0\n  * CUDA 9.0 or newer\n  * GCC5.4 or newer (Need C++14 support)\n  * Python 3.6 or newer\n  * Chainer v4 or newer\n  * Google glog 0.3.5 or newer\n* Optional requirements\n  * pybind11 2.1.4 or newer\n  * OpenCV 3.2 or newer (for ImageNet demo)\n    * With Python interface\n  * Google Test 1.8.0 (for automated tests)\n  * Google Benchmark 1.3.0 or newer (for automated micro benchmarks)\n\n\n------\n## Getting started with a minimum example\n\nPlease make sure that you have already satisfied the \"System requirements\" above.\n\n```bash\n# Install necessary tools\n% sudo apt install g++ cmake libgoogle-glog-dev libboost-all-dev libopencv-dev\n\n% git clone git@github.com:pfnet-research/chainer-trt.git\n% cd chainer-trt\n% mkdir build; cd build\n% cmake -DWITH_PYTHON_LIB=no -DWITH_TEST=no ..\n% make\n```\n\nYou will find libchainer_trt.so and libchainer_trt.a in the build directory.\nThey are the main library of chainer-trt.\n\nHere is an example for building TensorRT inference engine of an ImageNet image classification network using chainer-trt.\n\n```bash\n% cd /path/to/chainer-trt       # If you're in the build directory, go back to root\n\n# Get an example image\n% wget -nv \"https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/359px-Cat_November_2010-1a.jpg\" -O cat.jpg\n\n# Dump a Chainer-based pretrained network to an intermediate representation\n% python example_imagenet/dump_chainer.py ResNet50Layers -r resnet50\n\n# Build a TensorRT engine for the 0-th GPU, this may take a while\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder -g 0 -i resnet50 -o resnet50/fp32.trt\n\n# Run the built engine on the 0-th GPU with an image for 1000 times\n% ./build/example_imagenet/imagenet_infer/imagenet_infer -m resnet50/fp32.trt -g 0 -i cat.jpg -n 1000\nUsing GPU=0 (Name=\"GeForce GTX 1080 Ti\",CC=6.1, VRAM=11162MB)\nBatch-size = 1\nLoading model\nLoading labels\nLoading image\nSend input to GPU\nAverage inference time = 2.596ms\nGet output from GPU\n0.457860 - tiger cat\n0.326168 - tabby, tabby cat\n0.204899 - Egyptian cat\n0.001696 - lynx, catamount\n0.000969 - plastic bag\n```\n\nFor comparison, you can run Chainer-based inference with the same CNN.\n\n```bash\n% python example_imagenet/imagenet_infer_reference.py -m ResNet50Layers -i cat.jpg -n 1000 -g 0\nLoading model\nLoading labels\nLoading image\nInference\nAverage inference time = 19.139971890000197ms\n0.468369 - tiger cat\n0.308983 - tabby, tabby cat\n0.202293 - Egyptian cat\n0.003273 - lynx, catamount\n0.001592 - plastic bag\n```\n\nPlease be noted that there can be some numerical differences between these\nmodels results because non-determinant convolution algorithm can be\nchosen through the optimization process in TensorRT.\n\n\n------\n## Components\n\nchainer-trt consists of the following main components.\n\n\n### ModelRetriever\n\nThe Python-side key component to dump a chainer-based neural network to an\nintermediate representation.\n\n\n\n### chainer-trt main library\n\nAnother key component in C++ side, which essentially wraps TensorRT C++ API.\nIt has interfaces for building a TensorRT inference engine, and interfaces for\nrunning inference.\n\n\n### Plugin libraries\n\nTensorRT has a mechanism to insert custom layer implementations for layers\nthat are not natively supported by TensorRT.\n\nchainer-trt provides various plugin implementations including\n[Shift operation](https://docs.chainer.org/en/stable/reference/generated/chainer.functions.shift.html),\n[Resize Images operation](https://docs.chainer.org/en/stable/reference/generated/chainer.functions.resize_images.html),\nand so on.\n\nThese plugins can be automatically used when you use corresponding Chainer function in your forward-pass code.\n\n\n#### Developing your own plugin\n\nIn case you need your own plugin, there are two ways.\n\n* Implement it as a chainer-trt contribution\n* Implement it outside chainer-trt and tell its existence to chainer-trt\n\nIf your plugin is generally useful and non-proprietary,\nplease consider implementing it as a part of chainer-trt plugin library,\nso that anyone who uses chainer-trt can make use of it.\nHow to implement plugins inside chainer-trt can be learned by looking at\n`src/include/plugins/(plugin_name).hpp` and `src/plugins/(plugin_name).cpp`,\nalso `src/plugins/plugin.cpp` to let chainer-trt recognize it.\n\nOtherwise, in order not to disclose the detail of the plugin operator,\nyou can implement it *outside* chainer-trt inject to build and load process of an engine.\nThis is called *external plugins*.\nThe example `example_external_plugin` shows how to implement it.\n\nIn either cases, you need to follow these steps.\n\n1. Implement a plugin class by `chainer_trt::plugin::plugin_base\u003cT\u003e`\n  1-1. Implement CUDA kernels that operates the actual process\n  1-2. Write sufficient tests to confirm the kernels work\n2. Implement a builder function (`build_layer` in examples)\n3. Register builder function and deserializer function to plugin factory\n  3-1. In case of internal plugins, you can call it in ctor of `plugin_factory`\n  3-2. In case of external plugins, you can register if after instantiating `plugin_factory`\n\n\n### Chainer(Cupy) compatible inference interface\n\nchainer-trt provides a thin wrapper of C++ interface so that you can\ndirectly and easily call inference process from Python with numpy and cupy arrays.\n\n    \n------\n## Installation detail\n\nCurrently, chainer-trt has to be built manually,\nfor both Python part and C++ part.\n\n\n### Installing `ModelRetriever`\n\nPython module including `ModelRetriever` has to be installed by setup.py\nas follows.\n\n```bash\n% cd /path/to/chainer-trt\n% pip install -e .\n# OR\n% python setup.py install\n```\n\nThen, please make sure if it's correctly installed.\n```bash\n% python -c \"import chainer_trt\"\n```\n\n\n### Installing main library\n\nchainer-trt uses CMake to build.\n\n```bash\n% cd /path/to/chainer-trt\n% mkdir build; cd build\n% cmake -DWITH_TOOLS=YES ..\n% make\n% make install\n```\n\nYou can switch which components to build by options.\n\n\n#### Build tools (`-DWITH_TOOLS`, default=`YES`)\n\ntools consist of a tiny conversion program to convert dumped chainer\nmodel to TensorRT engine file.\n\n\n#### Build ImageNet examples (`-DWITH_EXAMPLES`, default=`YES`)\n\nImageNet examples that are used in the quick start section.\nDescribed in details later.\nThis requires OpenCV.\n\n\n#### Build Python inference interface (`-DWITH_PYTHON_LIB`, default=`YES`)\n\nA thin wrapper interface to bridge Python and C++ world.\nThis requires `Python.h` to be visible from compiler.\nRun this command before cmake.\n```\n% export CPATH=$CPATH:\\`python -c \"import distutils.sysconfig; print(distutils.sysconfig.get_python_inc())\"`\n```\nA shared object `libpyrt.so` will be created. This has to be in a location\nvisible from `PYTHONPATH` (not `LD_LIBRARY_PATH`).\nYou can confirm if Python can find and load it by\n```python\n% python -c \"import chainer_trt; print(chainer_trt.is_python_interface_built)\"\nTrue\n```\n(If you got `False`, something is wrong).\n\n\n#### Build Automated tests (`-DWITH_TEST`, default=`YES`)\n\nTest of C++ part. This requires google glog and google test.\n\n\n#### Automated micro benchmarks (`-DBENCHMARK`, default=`NO`)\n\nIn order to help optimizing plugin implementations, chainer-trt provides\nseveral micro-benchmark codes.\nThis requires google glog and google benchmark.\n\n\n#### nvprof profiling improvement (`-DWITH_NVTX`, default=`NO`)\n\nAn NVTX extension is built if this option is enabled.\n\nnvprof (and nvvp) is very useful for performance analysis of CUDA kernels.\nBut they basically just show timeline of CUDA kernels,\nwhich is sometimes difficult to know the semantic correspondence between timelines and codes\n(especially with a black box, TensorRT).\n\nNVTX is a CUDA API that allows user to show an arbitrary bar in nvvp profiling result.\nchainer-trt's NVTX extension is to show additional timelines in NVTX\nwhen running inference.\n\n\n------\n## Detailed flow\n\nAs shown in the above diagram and quick-start, workflow of chainer-trt has\nthe following steps.\n\nDetailed explanation of each step is described in the later sections.\n\n\n### (1) Model dump process\n\nThe first process is to convert a Chainer model to an intermediate representation.\n\nLet's suppose you have a Chainer-based inference code,\nnext step is to let chainer-trt figure out structure of the computational graph\nand its layers' parameters (weights), which we call \"dump\" process.\n\nHere is a simple Python-based inference skeleton.\n\n```python\nimport chainer\n\nclass Net(chainer.Chain):\n    def __init__(self):\n        ...\n        \n    def forward(self, x):\n        h = f1(x)\n        h = f2(h)\n        return h\n\nx = ...     # prepare input\nwith chainer.using_config('train', False):\n    y = net(x)\nprint(y)    # show prediction\n```\n\nIn order to dump the network to a file,\nyou just need to run forward pass with a dummy data once and\npass the output (whose type is `chainer.Variable` or `chainer.VariableNode`) to\nchainer-trt (`ModelRetriever` object).\n\nThen chainer-trt automatically retrieves every information that is needed\nto describe about the network, and saves them to a destination directory.\n\n```python\nimport chainer_trt\n\nx = chainer.Variable(np.random.random((1, 3, 10, 10)))        # something dummy input\n\nretriever = chainer_trt.ModelRetriever(\"dump_out\")\nretriever.register_inputs(x, name=\"input\")\nwith chainer.using_config('train', False):\n    with chainer_trt.RetainHook():\n        y = net(x)\n\nretriever(y, name=\"prob\")\nretriever.save()\n```\n\nThe output `\"dump_out` is a directory including the following files.\n\n* `model.json`: Describes network structure including input and output\n* `*.weights`: Parameters of each layers. e.g. One Conv layer may have 2 weight files for conv parameter and bias values\n\n\n\n### (2) Build TensorRT engine\n\nAfter getting an intermediate representation of your NN,\nthe next step is to build an inference engine.\n\nThis step is in C++ world.\n\n```cpp\n#include \u003cchainer_trt/chainer_trt.hpp\u003e\n...\n\nauto m = chainer_trt::model::build_fp32(\"dump_out\",\n                                        4.0,   // workspace size in GB\n                                        1);    // max batch size\nm-\u003eserialize(\"fp32.trt\");\n```\n\nYou can simply call `chainer_trt::model::build_fp32` with directory name\nof the intermediate representation, and call `serialize` to save it to a file.\n\nBuild process does device-independent and device-specific optimization,\nexplained in [Deploying Deep Neural Networks with NVIDIA TensorRT](https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/).\n\n**FP16 mode**\n`build_fp16` is also available to build FP16 mode inference engine.\nWith natively supported hardware like V100, it brings significant speedup,\nbut otherwise it doesn't, moreover it just increases type conversion overhead.\n\n**INT8 mode**\nTo build INT8 mode, you need to call `build_int8` with *calibration datasets*.\nTensorRT provides calibration mechanism to intelligently identify the\nquantization criteria for each layer based on the actual data.\nSo you need to implement a task-specific stream class that feeds the actual\ndata one after another to build an inference engine with INT8 mode.\nThis is explained in detail in the later section.\n\nBe noted that the built engine file is *NOT* compatible with any other environment,\nsince it is optimized specifically for your GPU, system, and environment.\n\nSo if you need to build inference engine for several environments,\nyou need to run the build process on each of them.\nDump process is environment-independent, so you need to do only once and re-use.\n\nAlso, a simple default builder tool that only supports FP32 and FP16\nis already provided (you can find it in `tools/` directory).\n\nAfter building chainer-trt, running this tool builds the model.\n\n```bash\n% ./build/tools/tensorrt_builder/tensorrt_builder -i dump_out -o fp32.trt\n```\n\nThis tiny tool is perfectly enough as long as you use only FP32 and FP16 mode.\n\n\n\n### (3a) Run inference (from C++ code)\n\nThe next step is to run inference.\nThis section explains how to run inference from C++.\n\nFirst load a model and initialize an engine.\n\n```cpp\n#include \u003cchainer_trt/chainer_trt.hpp\u003e\n\n// This part is needed if you specify in/out name by name with string literals\n#include \u003cstring\u003e\nusing namespace std::literals::string_literals;\n\n...\n\nauto m = chainer_trt::model::deserialize(\"fp32.trt\"); \nchainer_trt::infer rt(m);\n```\n\nThen, load an input to a host-side buffer and call `infer_from_cpu` to get inference result.\n\n```cpp\nfloat* x = new float[...];  // input\nfloat* y = new float[...];  // output\n\nfor(;;) {\n    // load input data to x\n    load_input(x, ...);\n    \n    rt.infer_from_cpu(1,        // batch size\n                      {{\"input\"s, x}},\n                      {{\"prob\"s, y}});\n                      \n    // here, y has output values\n}\n```\n\nCUDA stream can also be used to overlap inference processes.\n\n```\ncudaStream_t s;\ncudaStreamCreate(\u0026s);\nrt.infer_from_cpu(1, {{\"input\"s, x1}}, {{\"prob\", y1}}, s);\nrt.infer_from_cpu(1, {{\"input\"s, x2}}, {{\"prob\", y2}}, s);\n```\n\nBut be careful that multiple inference processes with single `chainer_trt::infer`\ninstance **cannot** be ran from the same thread, as it is not thread safe.\n\nIn case you want to run multi-threaded inference,\nyou have to instantiate `chainer_trt::infer` for each worker.\nRefer the ImageNet example below for more details.\n\n\n#### Efficient memory control (`chainer_trt::buffer`)\n\nEvery time `chainer_trt::infer::infer_from_cpu` is called,\nit allocates GPU memory for input and output and deallocate after inference,\nwhich is less efficient.\n\n`chainer_trt::buffer` provides a simple way to manage GPU buffers and\nkeep them alive as long as needed.\n\n\n```cpp\nchainer_trt::infer rt(m);\nauto buf = rt.create_buffer(1);     // allocates GPU memory\n\nfloat* x = new float[...];  // input on CPU side\nfloat* y = new float[...];  // output on CPU side\n\nfor(;;) {\n    // load input data to x\n    load_input(x, ...);\n    \n    buf-\u003einput_host_to_device({{\"input\", x}});\n    rt(*buf);   // run inference\n    buf-\u003eoutput_device_to_host({{\"prob\", y}});\n}\n```\n\nSince GPU memories are allocated before the inference loop,\nmemory allocation overhead won't happen.\nBy using `buffer`, you don't have to manually and separately allocate GPU memory and manage them.\n\n\n#### Manual buffer control\n\n`chainer_trt::buffer` assumes that you don't have any preprocessing and postprocessing\non GPU side before and after the inference process.\nSo this is sometimes less flexible in case you need to modify input and/or outputs on GPU.\n\nIn such case, manual memory control is needed.\n\n```cpp\nfloat* x = new float[...];\nfloat* x_gpu;\ncudaMalloc(\u0026x_gpu, sizeof(float) * ....);\nfloat* y_gpu;\ncudaMalloc(\u0026y_gpu, sizeof(float) * ....);\n\nfor(;;) {\n    // load input data to x\n    load_input(x);\n    \n    // send it to GPU\n    cudaMemcpy(x_gpu, x, sizeof(float) * ...., cudaMemcpyHostToDevice);\n    \n    // do some preprocessing on GPU\n    preprocessing(x_gpu);\n    \n    // run inference (call chainer_trt::infer::operator())\n    rt(1, {{\"input\", x_gpu}}, {{\"output\", y_gpu}});\n    \n    // do some postprocessing on GPU\n    postprocessing(y_gpu);\n    \n    cudaMemcpy(y, y_gpu, sizeof(float) * ...., cudaMemcpyDeviceToHost);\n}\n```\n\n\n### (3b) Run inference (from Python)\n\nchainer-trt also provides an interface to run inference from Python code.\n\nThis interface accepts both numpy array and cupy array,\nand if numpy arrays are specified as inputs, returned value will be a list of\nnumpy arrays, and vice versa.\n\n```python\nimport chainer_trt\n\n# load an inference engine\ninfer = chainer_trt.Infer(\"fp32.trt\")\n\nx = cupy.array()  # prepare a data. numpy.array is also OK\n\n# run inference\ny = infer({'input': x})['prob']\n#y = infer([x])[0]      # this is also OK\n\n# here, y has output values\n```\n\nFrom performance perspective, writing everything in C++ and\nkicking the inference from there would be the best,\nbut in some cases it is very useful if we can directly call\nhighly optimized inference from Python,\ne.g. when integrating TensorRT in a web system.\n\n\n\n------\n## ImageNet example details\n\nExamples in example_imagenet explains how to dump Chainer-based models, build them and efficiently running inference.\n\n| Code                              | Description                                                                                       |\n|:----------------------------------|:--------------------------------------------------------------------------------------------------|\n| dump_chainer.py                   | A simple tool to dump Chainer-predefined ImageNet                                                 |\n| dump_caffemodel.py                | *not well-tested* An example to dump caffemodel using Chainer's CaffeFunction                     |\n| imagenet_tensorrt_builder/        | Inference engine builder for ImageNet, with INT8 calibration support                              |\n| imagenet_infer/                   | An example of single-image inference, useful for inference latency benchmark                      |\n| imagenet_infer_fast/              | An example of high-throughput inference example, useful for throughput benchmark                  |\n| imagenet_infer_reference.py       | Chainer-based single-image inference example (equivalent to imagenet_infer), for checking result  |\n| imagenet_infer_reference_eval.py  | Chainer-based high-throughput inference example (equivalent to imagenet_infer_fast)               |\n| imagenet_infer_tensorrt.py        | An example of running TensorRT inference from Python                                              |\n\nIn the following sections, you are assumed to have built chainer-trt in `build` directory in chainer-trt root.\n\n\n### Example 1: Building an inference engine of chainer-predefined ImageNets\n\nThe very basic usage of ImageNet example is shown in quick start section.\nhere explains a bit more details.\n\n#### Dump model\n\nFirst of all you have to dump a Chainer-based model to an intermediate representation.\nThis process itself doesn't require TensorRT in your system, since it just traces computational graphs and saves to files.\n\n```bash\n% python dump_chainer.py ResNet50Layers -r resnet50\n```\n\nThe supported chainer-predefined ImageNet classified models are\n`ResNet50Layers`, `ResNet101Layers`, `ResNet152Layers`, `VGG16Layers` and `GoogLeNet`,\nwhich are implemented in `chainer.links`.\n\nBe noted that you may have to manually prepare pretrained weights file in advance (ResNets).\n\n##### Include preprocessing in computational graph\n\n`dump_chainer.py` not only just dumps the core CNN, it also includes mean subtraction in the dumped computational graph\nand transposing HWC to CHW order.\n\nUsually ImageNets assume input images to be zero-mean and its order is CHW format, with user's responsibility.\nBut in this example, these preprocessings are done as a part of computational graph,\nso you can directly feed raw images loaded from disk into CNN (you still need to convert data to float32, though).\n\nThe important point here is that you have to make input `chainer.Variable` **before**\napplying operations that you'd like to include in the dump.\nOtherwise these operations are not recorded in the computational graph.\n\n```python\nx = chainer.Variable(x)     # \u003c-----\n...\nmean = numpy.array([103.939, 116.779, 123.68]).astype(numpy.float32)\n...\nx = x.transpose((0, 3, 1, 2))   # hwc2chw\nx = x - mean\n```\n\n\n##### Tell what is the input\n\nAnother important point is that you also have to tell `ModelRetriever` which Variable is the input,\nbecause from computational graph point of view, both `x` and `mean` in the above example are the\nterminal node, and there's no way to know which is the input.\n\nSo you have to explicitly let it know.\nBe noted that `x` here needs to be `chainer.Variable` rather than numpy/cupy array.\n\n```python\nretriever.register_inputs(x)\n```\n\nIf you forget doing this, chainer-trt will treat both `x` and `mean` as input,\nthus you will have to feed not only input image but also mean value array.\n\nIn the `dump_chainer.py`, name option is fed to `register_inputs`,\nso that you can specify input data by name during inference.\nThis is useful if your NN has multiple inputs.\nIf name is not specified, chainer-trt automatically decides name for that input.\n\n```python\nretriever.register_inputs(x, name=\"input\")\n```\n\n\n##### Verbose mode\n\n`ModelRetriever` supports verbose mode, and this can be enabled by `--verbose` option to `dump_chainer.py`.\n\nIn verbose mode, the following\n* `ModelRetreiver__call__` will print layer name it has detected to stdout.\n* In dump destination directory,\n  * All the input and output values of each layer are saved, and `model.json` will include filename of them, which is useful for debug purpose\n  * `model.json` will be prettified\n  * Visualized computational graph using `chainer.computational_graph` is saved\n\n\n#### Build an inference engine\n\nNext step is to build an inference engine.\n\n```bash\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder -i resnet50 -o resnet50/fp32.trt\n```\n\nThis tiny tool is basically just to call `chainer_trt::model::build_{fp32/fp16/int8}`,\nso regarding the options for workspace size (`--workspace`, `-w`) and max batch size (`--max-batch`, `-b`),\nplease refer the above sections.\n\nTensorRT uses a device currently active (set by `cudaSetDevice`) and optimizes inference engine for that particular device.\n`imagenet_tensorrt_builder` has an option `--gpu` (`-g`) option, where you can specify on which GPU your inference engine will run.\n\nWith `--mode fp32` (default) and `--mode fp16`, you don't need anything additionally.\n`--mode int8` requires an additional option, which is exlpained in detail in \"Example 2: Building INT8 TensorRT engine\".\n\n\n#### Run the built inference engine\n\nThe simplest example of running inference is `imagenet_infer`.\n\n```\n% ./build/example_imagenet/imagenet_infer/imagenet_infer -m resnet50/fp32.trt -i cat.jpg -n 1000\n```\n\nThis tool measures average inference time for each batch,\n*without* data transfer between host and GPU.\nSo the result can be a minimum latency of your model on your device.\n\nIt also has an option `--gpu` (`-g`). The same GPU must be specified in build phase and inference phase.\n\n\n##### Layer-wise profiling\n\nBy adding `--prof` option to `imagenet_infer`,\nit reports layer-wise execution time in the specified format as follows (`md` (markdown table) and `csv` are supported).\n\n```\n% ./build/example_imagenet/imagenet_infer/imagenet_infer -i cat.jpg -m resnet50/fp32.trt -n 1000 --prof md\n...\n| Layer name                                           | #call |   total ms |  ms/call |        % |\n|:-----------------------------------------------------|:------|:-----------|:---------|:---------|\n| ConstantInput-0                                      |  1000 |    1.23405 |  0.00123 |   0.044% |\n| Transpose-0-1                                        |  1000 |   13.49645 |  0.01350 |   0.481% |\n| Sub-1-1                                              |  1000 |    6.73158 |  0.00673 |   0.240% |\n| Convolution2DFunction-2-1 + ReLU-4-1                 |  1000 |   45.12035 |  0.04512 |   1.607% |\n| MaxPooling2D-5-1                                     |  1000 |   15.57731 |  0.01558 |   0.555% |\n| Convolution2DFunction-6-2 + ReLU-8-1                 |  1000 |   19.45376 |  0.01945 |   0.693% |\n| Convolution2DFunction-9-1 + ReLU-11-1                |  1000 |   26.27018 |  0.02627 |   0.936% |\n| Convolution2DFunction-12-1                           |  1000 |   26.15917 |  0.02616 |   0.932% |\n...\n```\n\n\n### Example 2: Building INT8 TensorRT engine\n\nThis section explains how to build an INT8 inference engine.\n\nIn TensorRT, post-training quantization is realized as explained in\n[8-bit Inference with TensorRT](http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf).\n\nIn order to build INT8 mode ImageNet inference engine with `imagenet_tensorrt_builder`,\nfirst you have to prepare a list of calibration images.\n\nCalibration images used for obtaining layer-wise distribution of activations,\nso they have to be sampled practical images, preferably part of data used for training the model.\n\n```bash\n% tree /path/to/ILSVRC2012/train | grep 'jpeg$' | shuf | head -n 1000 \u003e calib.list\n```\n\nThe appropriate size of calibration set depends on task, model and so on, but typically 10^3~10^4 order is enough.\n\nIn dump phase, you don't need anything special.\nSo now you can run the build process.\n\n```bash\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder \\\n    -i resnet50 -o resnet50/int8.trt --mode int8 --calib calib.list\n```\n\nThis takes a few minutes (depends on model, hardware ability and size of calibration set).\nAfter the build process completed, you can run it similarly as the engine built with fp32 and fp16.\n\n```bash\n% ./build/example_imagenet/imagenet_infer/imagenet_infer -i cat.jpg -m resnet50/int8.trt -n 100Using GPU=0 (Name=\"GeForce GTX 1080 Ti\",CC=6.1, VRAM=11162MB)\nBatch-size = 1\nLoading model\nLoading labels\nLoading image\nSend input to GPU\nAverage inference time = 0.864\nGet output from GPU\n0.459342 - tiger cat\n0.321516 - tabby, tabby cat\n0.212346 - Egyptian cat\n0.001113 - lynx, catamount\n0.000704 - tiger, Panthera tigris\n```\n\n\n##### Implementing calibration stream\n\nSince calibration is task specific and model specific,\nyou will have to implement your own class to feed calibration image to TensorRT builder,\nwhich we call calibration stream.\n\n`imagenet_tensorrt_builder` implements a simple example of calibration stream.\nIt receives a list of filename of images as a calibration set,\nand every time `get_batch` is called, it loads the next data to the designated buffer.\n\nMinimum skeleton is like this.\n\n```cpp\nclass my_calib_stream : public chainer_trt::calibration_stream {\npublic:\n    my_calib_stream(...) {...}\n\n    virtual int get_n_batch() override { return number_of_samples; }\n    virtual int get_n_input() override { return number_of_inputs; }\n\n    virtual void get_batch(int i_batch, int input_idx,\n                           const std::vector\u003cint\u003e\u0026 dims,\n                           void* dst_buf_cpu) override {\n        // load input_idx-th input of i_batch-th data\n        // to dst_buf_cpu\n    }\n```\n\nImageNets have only one input, so `get_batch` is called only once for each calibration data,\nbut if your network has multiple inputs, `get_batch` is called for each input for each data.\n\n\n\n##### Calibration cache\n\nThe build process with INT8 calibration is very time-consuming,\nbut the calibration information (distribution of activations) are not device-dependent,\nso TensorRT provides a way to re-use previous result of calibration to save build time.\n\n```bash\n# Build INT8 engine with a calibration cache\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder \\\n    -i resnet50 -o resnet50/int8.trt --mode int8 --calib calib.list \\\n    --out-cache resnet50/calib_cache.dat\n\n# Once you have made a cache, you don't need to specify calibration set, but just need the cache\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder \\\n    -i resnet50 -o resnet50/int8.trt --mode int8 --in-cache resnet50/calib_cache.dat\n```\n\nWhen implementing a calibration stream, you don't need anything to support calibration cache.\n\n\n\n### Example 3: Running inference from Python\n\nchainer-trt provides a thin wrapper interface to run inference from Python side.\n`imagenet_infer_tensorrt.py` shows a simple example of how to use this.\n\nBefore running it, please make sure that `libpyrt` is correctly built and\nvisible from Python interpreter (see the chainer-trt build section).\n\nDump process and build process are the same as above examples.\n\n```bash\n% python example_imagenet/imagenet_infer_tensorrt.py -m resnet50/fp32.trt -i cat.jpg\nBatch size = 1\nLoading model\nLoading labels\nLoading image\nMode: directly feed cupy array\nInference\nAverage inference time (not including CPU-\u003eGPU transfer) = 3.8590860003751004ms\n0.468369 - tiger cat\n0.308983 - tabby, tabby cat\n0.202293 - Egyptian cat\n0.003273 - lynx, catamount\n0.001592 - plastic bag\n```\n\n`libpyrt` has the following modes.\n\n**(1) Call `chainer_tensorrt.Infer.__call__` with numpy array**\n\nchainer-trt automatically sends the data from CPU to GPU,\nruns inference, and bring the result back to CPU as `numpy.array`.\nThis is equivalent to chainer-trt C++ interface `chainer_trt::infer::infer_from_cpu`.\n  \n\n**(2) Call `chainer_tensorrt.Infer.__call__` with cupy array**\n\nRather than `numpy.array`, you can also pass `cupy.array`.\nThe result is also `cupy.array`, so data transfer between CPU and GPU won't happen.\nThis is equivalent to chainer-trt C++ interface `chainer_trt::infer::operator()` with raw pointers.\n\n\n**(3) Call `chainer_tensorrt.Infer.__call__` with cupy array**\n\nThe above (1) and (2) dynamically allocate memory several times.\nIn order to reduce this overhead, `chainer_trt.Buffer` is available,\nwhich is equivalent to `chainer_trt::buffer`.\n\n\n`imagenet_infer_tensorrt.py` supports these 3 modes, with `--mode {cupy|numpy|buffer}`.\n\n\n### Example 4: High-throughput inference\n\nIn case inference throughput is more important than latency,\nbatch-nization and concurrent execution are effective (c.f. [Best Practices For TensorRT Performance](https://docs.nvidia.com/deeplearning/sdk/tensorrt-best-practices/index.html))\n`imagenet_infer_fast` is an example of such a high-throughput inference.\nThis tools evaluates the classification accuracy of a model using ImageNet validation images.\n\nFirst you need to prepare for a validation images in `chainer.datasets.LabeledImageDataset` format like below.\n\n```\nval/ILSVRC2012_val_00000001.JPEG 65\nval/ILSVRC2012_val_00000002.JPEG 970\nval/ILSVRC2012_val_00000003.JPEG 230\nval/ILSVRC2012_val_00000004.JPEG 809\nval/ILSVRC2012_val_00000005.JPEG 516\nval/ILSVRC2012_val_00000006.JPEG 57\nval/ILSVRC2012_val_00000007.JPEG 334\n...\n```\n\nEach line consists of a relative path to the image from a certain root directory (here I call `$ILSVRC2012_ROOT`) and ground truth label index.\nSave this list as `val.txt`.\n\nYou also need to build an inference engine built with large batch size like 8 or 16 (please refer to the above examples).\n\n```bash\n% ./build/example_imagenet/imagenet_tensorrt_builder/imagenet_tensorrt_builder -i resnet50 -o resnet50/fp32_b8.trt -b 8\n```\n\nNow, run the high-throughput inference.\nIn this case, 8-parallel inference worker thread will run, and each worker runs inference with batch-size 8.\n\n```bash\n% ./build/example_imagenet/imagenet_infer_fast/imagenet_infer_fast -m resnet50/fp32_b8.trt -i val.txt -p $ILSVRC2012_ROOT -n 8 -b 8\nUsing GPU=0 (Name=\"GeForce GTX 1080 Ti\",CC=6.1, VRAM=11162MB)\nRunning inference\ntop1 accuracy 67.148%\ntop5 accuracy 87.018%\ntotal time 53.1014s\naverage time 8.496ms/batch (1.062ms/image)\n```\n\nDuring the inference loop, you can see that GPU is completely occupied in nvidia-smi command if your disk is fast enough.\n\n```bash\n% nvidia-smi\n+-----------------------------------------------------------------------------+\n| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |\n|-------------------------------+----------------------+----------------------+\n| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |\n| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |\n|===============================+======================+======================|\n|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |\n| 32%   62C    P0   255W / 250W |   3386MiB / 11162MiB |    100%      Default |\n+-------------------------------+----------------------+----------------------+\n```\n\n\n#### Implementing concurrent inference\n\nIn case implementing multi-thread concurrent inference,\n`chainer_trt::model` can be re-used for multiple workers,\nthus it needs to be loaded only once at the very beginning of the program.\n\nIn contrast, `chainer_trt::infer` is *not thread* safe and *must be initialized for each worker*.\n\n\n\n#### Python-based high-throughput inference (reference implementation)\n\n`imagenet_infer_reference_eval.py` does the same thing in Chainer's world.\n\n```bash\n% python example_imagenet/imagenet_infer_reference_eval.py -m ResNet50Layers -i val.txt -p $ILSVRC2012_ROOT -n 8 -b 8 \nTop1 accuracy = 67.148%\nTop5 accuracy = 87.018%\nTotal time = 94.63s\nAverage time = 15.141ms/batch, 1.893ms/image\n```\n\n\n------\n## YOLOv2 example details\n\nThere is a simple YOLOv2 object detection example in `example_yolo`.\n\nHow this example is made is basically same as ImageNet examples explained above.\n\n```python\n# Dump network structure\n% python example_yolo/dump_yolo.py --gpu 0\n\n# Build an inference engine\n% tensorrt_builder -i dump_yolo -o dump_yolo/fp32.trt\n\n# Run inference\n% python example_yolo/yolo_infer_tensorrt.py --engine dump_yolo/fp32.trt cat.jpg -n 1000 \nLoaded TensorRT inference engine dump_yolo/fp32.trt                                                                                     │rm 'test/fixtures/tensorrt_model/leaky_relu/leaky_relu_slope04/in.csv'\n8.801ms/img\n\n# Also detection will be displayed\n```\n\nIf you don't specify `--engine` option to `yolo_infer_tensorrt.py`,\nit will run inference based on default (chainer-based) model,\nso that you can compare the results.\n\n\n------\n## Profiling support\n\n### nvprof/NVVP profiling support\n\nIf `-DWITH_NVTX=YES` is specified when building chainer-trt,\nit enables NVVP (Nvidia Visual Profiler) visualization improvement.\n\nNVTX is a CUDA API to allow show arbitrary time line in NVVP.\n\nBy default, without NVTX, NVVP profiling visualization of ImageNet inference\nwill be like this.\n(Be noted that nvprof/nvvp can still be used without chainer-trt built with `-DWITH_NVTX=YES`)\n\n![NVVP without NVTX](images/nvvp_without_nvtx.png)\n\n\n\nWhen NVTX hook is enabled, there will be timelines semantically meaningful\nalong with CUDA kernels.\n(Note: There could be a gap with timeline of CUDA kernels,\nbecause of asynchronous execution.)\n\n![NVVP with NVTX](images/nvvp_with_nvtx.png)\n\n\nYou can use this to visualize your own time line in NVVP.\n(Be noted that `nvtx_profile`, not `chainer_trt::nvtx_profile`)\n\n```cpp\n#include \u003cchainer_trt/profiling.hpp\u003e\n#define WITH_NVTX\n\nnvtx_profile(\"running inference\") {\n    rt(*buf);\n}\n```\n\nIf a macro `WITH_NVTX` is defined before `nvtx_profile` is called,\ntimeline using NVTX API is enabled,\notherwise it is simply ignored with no overhead.\n\n\n\n### Layer-wise profiling\n\nAs already shown in the Examples sections above, \nTensorRT provides a simple mechanism in order to measure layer-wise execution time.\nBy using this feature, you can analyze which layers are actually time-consuming\nin the neural network level, rather than CUDA kernel level.\n\nYou just need to create an instance of `chainer_trt::default_profiler` and\npass it to constructor of `chainer_trt::infer`.\n\n```cpp\nauto prof = std::make_shared\u003cchainer_trt::default_profiler\u003e();\nchainer_trt::infer rt(m, prof);\n\n// run inference loop\nfor(...) {\n    rt(...);\n}\n\nprof-\u003eshow_profiling_result(std::cout, \"md\");\n```\n\nDuring inference loop it accumulates execution time for each layer inside.\nAfter the loop finished, it shows the report in markdown table format.\n\nAn example of code and usage is shown in Examples section.\n\n\n------\n## Debug support\n\nIn case a network has some issues when building inference engine TensorRT will raise errors with layer name.\nBut just a layer name and an internal error message reported by TensorRT runtime might not be informative enough\nto know what exactly is happening.\n\nThe problem here is that it could be quite difficult for users to know which part in Python code\ncaused the trouble even if name of the guilty layer is reported,\nbecause name of layer is automatically determined by chainer-trt.\n\nTo address this issue, chainer-trt (`chainer_trt.ModelRetriever`)\nprovides the following mechanisms.\n\n\n### `MarkPrefixHook`\n\nThis hook is to add a prefix to automatically-determined name of layers executed\nduring the lifetime of hook object.\n\n```python\nx = ...\nwith chainer.using_config('train', False), chainer_trt.RetainHook():\n    with chainer_trt.MarkPrefixHook('preprocessing'):\n        x = F.transpose(x, ...)\n        x = x - mean\n    with chainer_trt.MarkPrefixHook('main'):\n        y = net(x)\nretriever(y)\n```\n\nIn `model.json` in the dump destination directory, you will see layers named like below.\n\n```\npreprocessing-Transpose-0-1\npreprocessing-Sub-1-1\nmain-cnn-Convolution2DFunction-2-1\nmain-cnn-FixedBatchNormalization-3-1\nmain-cnn-ReLU-4-1\n...\n```\n\nIf you have some troubles reported during build process and it reports only layer name,\nyou can apply this hook to suspicious part in your Python forward-pass code.\n\nAnother use-case of `MarkPrefixHook` is profiling a certain part of NN.\nSince TensorRT runs the entire NN as a black box, we cannot directly know\nhow long a certain part of the NN takes.\n\nBy adding a prefix to every layers in a certain part of a NN\nand use **layer-wise profiling** feature explained above,\nyou can get information about the execution time of the part.\n\n\n### `TracebackHook`\n\nIf this hook is used, all the function (`chainer.Function`) call is\nrecorded with traceback, and is saved in `model.json`.\nSo if you have a layer name causing an error, you can immediately identify\nwhich part in Python code is causing the error.\n\n```python\nx = ...\nwith chainer.using_config('train', False), chainer_trt.RetainHook():\n    with chainer_trt.TracebackHook():\n        x = F.transpose(x, ...)\n        x = x - mean\n        y = net(x)\nretriever(y)\n```\n\nIn `model.json` in the dump destination directory, you will see layer with `\"traceback\"` field like below.\n\n```\n    {\n      \"type\": \"Transpose\",\n      \"name\": \"Transpose-0-1\",\n      \"rank\": 0,\n      \"source\": \"input-0\",\n      \"axes\": [\n        2,\n        0,\n        1\n      ],\n      \"traceback\": \"File \\\"example_imagenet/dump_chainer.py\\\", line 42, in \u003cmodule\u003e\\n    x = x.transpose((0, 3, 1, 2))   # hwc2chw\\n  File \\\"xxxx/lib/python3.6/site-packages/chainer/variable.py\\\", line 1096, in transpose\\n    return chainer.functions.transpose(self, axes)\\n  File \\\"xxxx/lib/python3.6/site-packages/chainer/functions/array/transpose.py\\\", line 72, in transpose\\n    return Transpose(axes).apply((x,))[0]\"\n    },\n```\n\n\n------\n## Development\n\n### Testing\n\nchainer-trt has tests for Python (dump) part and C++ (main library).\nPython part is not well-tested, but this is going to be improved.\n\n#### Python part\n\nIn Python part (mainly `ModelRetriever`), there are some test cases on top of pytest.\nWhen implementing some new features in Python side, please make sure the test passes.\n\nCurrently there is only a few tests. It's work in progress.\n\n```python\n% python -m pytest\n```\n\n#### C++ main library part\n\nIn C++ part, there are quite a lot of test cases on top of google testing framework.\nIn order to run the test, you have to install google test and build chainer-trt\nwith `-DWITH_TESTS=YES` option.\n\n(Pitfall: in Ubuntu, `apt install libgtest-dev` only installs headers and sources,\nso you need to build it.)\n\nWhen implementing some new features in C++ side, please make sure the test passes.\n\n```bash\n% ./build/test/test_chainer_trt\n```\n\n\n### Code format\n\n#### Python part\n\nPython codes are in `python` and `example_imagenet` directories.\nRun flake8 to check code format.\nNo warning should be reported.\n\n```bash\n% flake8 python test example_imagenet example_yolo\n```\n\n#### C++ main library part\n\nIn C++ part, we use `clang-format-6.0` with the project [.clang-format](.clang-format).\n\n```bash\n% clang-format-6.0 -i /path/to/cpp/file\n```\n\nC++ codes (.cpp and .hpp) have to be formatted.\n\nAs for automatically generated C codes (created by gengetopt),\nyou don't have to reformat them.\nFor CUDA codes (.cu), since clang-format doesn't understand CUDA-specific syntax,\nyou don't have to be keen for prettification, but it is much appreciated to\ntry clang-format and follow the advice if useful.\n\n\n\n### Micro-benchmark of plugins\n\nchainer-trt provides various custom operator implementations as\nTensorRT's plugin.\nSince optimizing its performance is chainer-trt developer's responsibility,\nthere is a reproducible micro-benchmark samples.\n\nThis requires [google benchmark library](https://github.com/google/benchmark),\nand chainer-trt built with `-DWITH_BENCHMARK=YES` option\n(this is by default `OFF` so you need to explicitly enable it).\n\n```\n# When running a particular benchmark case\n% ./build/benchmark/bench --benchmark_filter=\"shift\"\nRunning ./benchmark/bench\nRun on (8 X 4500 MHz CPU s)\nCPU Caches:\n  L1 Data 32K (x4)\n  L1 Instruction 32K (x4)\n  L2 Unified 256K (x4)\n  L3 Unified 8192K (x1)\n***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.\n***WARNING*** Library was built as DEBUG. Timings may be affected.\n------------------------------------------------------------------------------\nBenchmark                                       Time           CPU Iterations\n------------------------------------------------------------------------------\nbenchmark_shift_float/9/8/8/3/1              7455 ns       7455 ns      89274\nbenchmark_shift_float/9/8/8/3/2              7525 ns       7525 ns      94409\nbenchmark_shift_float/9/8/8/3/3              7046 ns       7046 ns      94236\nbenchmark_shift_float/25/8/8/5/1             7571 ns       7571 ns      92056\nbenchmark_shift_float/25/8/8/5/2             7206 ns       7206 ns      90150\nbenchmark_shift_float/25/8/8/5/3             7241 ns       7241 ns     102885\nbenchmark_shift_float/9/32/32/3/1            8075 ns       8075 ns      78097\nbenchmark_shift_float/9/32/32/3/2            8375 ns       8374 ns      81950\nbenchmark_shift_float/9/32/32/3/3            7754 ns       7754 ns      83699\n...\n\n% ./build/benchmark/bench\n...\n```\n\nWhen you implement a new custom plugin and it could be a crucial part\nin terms of performance, it doesn't have to be well optimized from the beginning,\nbut it is suggested to provide some benchmark cases in order to help\nfuture optimization.\n\n\n------\n## Acknowledgments\n\nThis repository includes source code of\n[picojson](https://github.com/kazuho/picojson) (Copyright 2011-2014 Kazuho Oku),\nwhich is provided in the following 2-clause BSD license.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n1. Redistributions of source code must retain the above copyright notice,\n   this list of conditions and the following disclaimer.\n2. Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpfnet-research%2Fchainer-trt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpfnet-research%2Fchainer-trt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpfnet-research%2Fchainer-trt/lists"}