{"id":28100721,"url":"https://github.com/mit-han-lab/inter-operator-scheduler","last_synced_at":"2025-09-02T15:14:10.093Z","repository":{"id":40997919,"uuid":"310002257","full_name":"mit-han-lab/inter-operator-scheduler","owner":"mit-han-lab","description":"[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration","archived":false,"fork":false,"pushed_at":"2022-04-27T20:35:11.000Z","size":3283,"stargazers_count":181,"open_issues_count":0,"forks_count":29,"subscribers_count":9,"default_branch":"master","last_synced_at":"2023-11-07T18:24:32.405Z","etag":null,"topics":["acceleration","cnn","inference-optimization","parallelism"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2011.01302","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mit-han-lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-11-04T12:53:28.000Z","updated_at":"2023-11-07T07:00:51.000Z","dependencies_parsed_at":"2022-08-02T18:00:19.932Z","dependency_job_id":null,"html_url":"https://github.com/mit-han-lab/inter-operator-scheduler","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Finter-operator-scheduler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Finter-operator-scheduler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Finter-operator-scheduler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mit-han-lab%2Finter-operator-scheduler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mit-han-lab","download_url":"https://codeload.github.com/mit-han-lab/inter-operator-scheduler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254004847,"owners_count":21998138,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["acceleration","cnn","inference-optimization","parallelism"],"created_at":"2025-05-13T18:38:29.307Z","updated_at":"2025-05-13T18:38:29.858Z","avatar_url":"https://github.com/mit-han-lab.png","language":"C++","funding_links":[],"categories":["C++"],"sub_categories":[],"readme":"# IOS: Inter-Operator Scheduler for CNN Acceleration [[arXiv]](https://arxiv.org/abs/2011.01302)[[Website]](http://www.yaoyaoding.com/ios/)\n\n\n* [1. Methodology](#1-methodology)\n* [2. Installation](#2-installation)\n  + [2.1 Prerequisites](#21-prerequisites)\n  + [2.2 Build IOS runtime](#22-build-ios-runtime)\n  + [2.3 Install IOS python package](#23-install-ios-python-package)\n* [3. Usage](#3-usage)\n* [4. Experiments](#4-experiments)\n  + [4.1 Experiment Environment Setup](#41-experiment-environment-setup)\n    - [4.1.1 Install TensorRT runtime in IOS](#411-install-tensorrt-runtime-in-ios)\n    - [4.1.2 Install TVM](#412-install-tvm)\n    - [4.1.3 Install TASO](#413-install-taso)\n    - [4.1.4 Install Tensorflow](#414-install-tensorflow)\n    - [4.1.5 Install PyTorch](#415-install-pytorch)\n    - [4.1.5 Lock GPU Clock Rate](#415-lock-gpu-clock-rate)\n  + [4.2 Experiments and ablation study](#42-experiments-and-ablation-study)\n    - [4.2.1 Comparison of Different Schedules](#421-comparison-of-different-schedules)\n    - [4.2.2 Comparison of cuDNN-based Frameworks](#422-comparison-of-cudnn-based-frameworks)\n    - [4.2.3 Utilization Profiling](#423-utilization-profiling)\n    - [4.2.4 Specialized Scheduling is Beneficial](#424-specialized-scheduling-is-beneficial)\n    - [4.2.5 Schedule Pruning Reduce Search Time](#425-schedule-pruning-reduce-search-time)\n    - [4.2.6 Consistent Improvement for Different Batch Sizes](#426-consistent-improvement-for-different-batch-sizes)\n    - [4.2.7 Intra- and Inter-Operator Parallelism](#427-intra--and-inter-operator-parallelism)\n\nTo accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization.\nHowever, a single operator can no longer fully utilize the available parallelism given the rapid advances in high-performance hardware, \nresulting in a large gap between the peak performance and the real performance. \nThis performance gap is more severe under smaller batch sizes.  \nIn this work, we extensively study the parallelism between operators and propose Inter-Operator Scheduler (IOS) to automatically schedule the execution of multiple operators in parallel. \nIOS utilizes dynamic programming to find a scheduling policy specialized for the target hardware. \nIOS consistently outperforms state-of-the-art libraries (e.g., TensorRT) by 1.1 to 1.5x on modern CNN benchmarks.\n\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/frameworks_comparison.png\" width=600\u003e\n  \n  End-to-end performance comparison of different frameworks across different CNNs on batch size one. \n  The throughput is normalized to the best one for each model.\n\u003c/div\u003e\n\n## 1. Methodology\n\nIOS partitions given computation graph into multiple \u003cem\u003e stages \u003c/em\u003e. Each stage has a \u003cem\u003eparallelization strategy\u003c/em\u003e. \n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"./figures/schedule_example.png\" width=400\u003e\n\u003c/div\u003e\nAs shown in the above figure, the computation graph in (1) is partitioned into two stages in (2). \nThe first stage contains operator a and b, and the second stage contains operator c, d, and e. \nThe first stage merge the two convolutions and the second stage concurrent execute the \u003cem\u003e independent \u003c/em\u003e groups of operators.\nSuch an partition with the parallelization strategy for each stage in the partition is called a \u003cem\u003e schedule \u003c/em\u003e for the computation graph in IOS.\n\nThe number of feasible schedules for a computation graph grows exponentially with respect with the number of operators in the computation graph. \nIt is challenging to find an highly optimized schedule of given computation graph within reasonable time. \nIOS takes advantage of the common sub-schedules among different schedules and utilizes dynamic programming technique to find an highly optimized schedule for given computation graph.\nFor more details, please refer the Methods section in our paper.\n\n\n## 2. Installation \n\nPlease follow this section to build IOS from source code.\n\n### 2.1 Prerequisites\n\n- CMake 3.10 or higher \n- [CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit) 10.0 or higher\n- [cuDNN](https://developer.nvidia.com/cudnn) 7.6.5 or higher\n\n### 2.2 Build IOS runtime\nTo get started, clone the IOS source code from Github.\n```shell script\ngit clone https://github.com/mit-han-lab/inter-operator-scheduler.git ios\ncd ios\n```\nThen build the IOS runtime:\n```shell script\nmkdir build\ncd build; \ncmake ..; make -j4\ncd ..\n```\n\n### 2.3 Install IOS python package\nOnce the IOS runtime has been built, run following commands to install the IOS python package.\n```shell script\ncd python; \npython setup.py install --user\n```\n\n\n## 3. Usage \nIOS optimizes user-defined computation graph and does inference on IOS runtime. The following code snip shows how to use IOS, in which user \n1. defines the computation graph first,\n2. then optimizes the execution schedule,\n3. and executes the network on IOS runtime at last.\n\n```python\nimport numpy as np\nimport ios\n\ndef sample_network():\n    v = ios.placeholder(output_shape=(375, 15, 15))\n    block = ios.Block(enter_node=v.node)\n    v1 = ios.conv2d(block, inputs=[[v]], out_channels=375, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')\n    v2 = ios.conv2d(block, inputs=[[v]], out_channels=750, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')\n    v3 = ios.conv2d(block, inputs=[[v]], out_channels=375, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')\n    v1 = ios.conv2d(block, inputs=[[v1]], out_channels=750, kernel=(3, 3), stride=(1, 1), padding=(1, 1), act='relu')\n    out = ios.identity(block, inputs=[[v1], [v2], [v3]], is_exit=True)  # concat v1, v2, and v3\n    graph = ios.Graph(name=\"demo\", input=v.node, blocks=[block])\n    graph.init_weights()\n    return graph\n\n# define computation graph\ngraph = sample_network()\n\n# optimize execution schedule\noptimized_graph = ios.optimize(graph, batch_size=1, opt_type='dp_parallel', compute_weight=True)\n\n# measure latency\ngraph.sequential_schedule()\nseq_latency, stage_latency = ios.ios_runtime.graph_latency(graph, batch_size=1, repeat=6, profile_stage=True)\nprint(graph)\nprint(f'Sequential schedule: {np.mean(seq_latency):.3f} ms')\nprint(f'      Stage latency: {np.mean(np.array(stage_latency).reshape(6, -1), axis=0)}\\n')\n\nopt_latency, stage_latency = ios.ios_runtime.graph_latency(optimized_graph, batch_size=1, repeat=6, profile_stage=True)\nprint(optimized_graph)\nprint(f'Optimized schedule: {np.mean(opt_latency):.3f} ms')\nprint(f'     Stage latency: {np.mean(np.array(stage_latency).reshape(6, -1), axis=0)}')\n\n# inference on ios runtime\ndummy_inputs = np.random.randn(1, 375, 15, 15)\noutput = ios.ios_runtime.graph_inference(optimized_graph, batch_size=1, input=dummy_inputs)\n```\nAn output of this program:\n```text\nSequential(\n  [1]Conv2d(0)\n  [2]Conv2d(0)\n  [3]Conv2d(0)\n  [4]Conv2d(1)\n  [5]Concat(4,2,3)\n)\nSequential schedule: 0.486 ms\n      Stage latency: [0.11070578 0.12603733 0.10604089 0.12549689 0.01794844]\n\nSequential(\n  Parallel(\n    [1]Conv2d(0)\n    [2]Conv2d(0)\n  )\n  Parallel(\n    [4]Conv2d(1)\n    [3]Conv2d(0)\n  )\n  [5]Concat(4,2,3)\n)\nOptimized schedule: 0.333 ms\n     Stage latency: [0.16145067 0.15448178 0.01732267]\n```\nThe following figure shows the sequential schedule and our schedule of the defined sample network. \n\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"./figures/sample.png\" width=600\u003e\n\u003c/div\u003e\n\n## 4. Experiments\nThe following parts shows the commands to reproduce all experiments and ablation study. \n\n### 4.1 Experiment Environment Setup\n\nIn this experiment, we compared IOS with different frameworks as follows\n\n- [TensorRT](https://developer.nvidia.com/tensorrt)\n- [TVM](https://docs.tvm.ai/install/from_source.html)\n- [TASO](https://github.com/jiazhihao/TASO)\n- [Tensorflow](https://www.tensorflow.org/)\n- [PyTorch](https://pytorch.org/)\n\nAll experiments all conducted under following environment.\n- Python 3.7\n- NVIDIA Driver 450.51.05\n- CUDA Toolkit 10.2\n- CUDNN 7.6.5\n- TensorRT 7.0.0.11\n- TVM v0.6\n- TASO v1.0\n- Tensorflow 2.3\n- PyTorch 1.6.0\n\nThe perquisites for each experiment(from 1 to 7) are\n- Experiment 1, 3, 4, 5 do not require any other frameworks/libraries\n- Experiment 2 requires TensorRT, TVM, TASO, Tensorflow, and PyTorch (you can ignore any of them if you do not want to compare IOS with it)\n- Experiment 6 requires TensorRT (you can ignore it if you only compare Sequential schedule and IOS optimized schedule)\n- Experiment 7 reqruies TVM\n\nWe recommend you reproduce the experiments in a conda environment:\n```shell script\nconda create -n ios python=3.7\nconda activate ios\n```\n\n#### 4.1.1 Install TensorRT runtime in IOS\n1. Download the [TensorRT](https://developer.nvidia.com/tensorrt) from NVIDIA website. We recommend to download the tar archive.\n2. Extract the TensorRT archive to somewhere. Please use the tar.gz file you downloaded.\n   ```shell script\n   tar xvzf ~/Downloads/TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz /path/to/unarchive\n   ```\n3. Configure the `config.cmake` file in ios root directory. Change `set(USE_TRT OFF)` to `set(USE_TRT /path/to/unarchive/TensorRT-TensorRT-7.0.0.11)`.\n4. Rebuild IOS runtime and TRT runtime, and reinstall IOS python package:\n   ```shell script\n   cd /path/to/ios; \n   mkdir -p build; cd build; cmake ..; make -j4; cd ..\n   cd python; python setup.py install; cd ..\n   ```\n5. Add `/path/to/tensorrt/lib` to the end of `LD_LIBRARY_PATH`.\n   ```shell script\n   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/path/to/TensorRT-x.x.x.x/lib\n   ```\n\nNow we finished the installation of TensorRT runtime in IOS. We can infer the IOS computation graph and measure its latency using `ios.trt_runtime` module as follows\n```python\nimport numpy as np\nimport ios\ngraph = ios.models.inception_v3()\n# measure latency\nlatency = ios.trt_runtime.graph_latency(graph, batch_size=1, repeat=5)\n# inference\noutputs = ios.trt_runtime.graph_inference(graph, batch_size=1, input=np.random.randn(1, 3, 299, 299))\n```\nModule `ios.trt_runtime` converts the IOS computation graph `ios.Graph` to the corresponding TensorRT network, measures the latency and executes the network using TensorRT library.\n\n#### 4.1.2 Install TVM\nPlease refer the [TVM installation guide](https://tvm.apache.org/docs/install/from_source.html) for the instructions to install TVM. \nBecause we need to customize the installation configuration (step 3 bellow), we put the installation commands here for simplicity. \n1. Clone the TVM source code from Github.\n   ```shell script\n   git clone https://github.com/apache/incubator-tvm.git tvm\n   cd tvm; \n   git checkout v0.6  # you can change v0.6 to v0.7 or higher to use higher version of TVM\n   git submodule update --resursive --init\n   mkdir build; cp cmake/config.cmake build; \n   ```\n2. Install `llvm` by `sudo apt install llvm`.\n3. Configure `build/config.cmake`.\n   1. Replace `set(USE_CUDA OFF)` by `set(USE_CUDA ON)` or `set(USE_CUDA /path/to/a/specific/cuda_toolkit)`.\n   2. Replace `set(USE_CUDNN OFF)` to `set(USE_CUDNN ON)`.\n   3. Replace `set(USE_LLVM OFF)` to `set(USE_LLVM ON)`. \n4. Build and Install TVM.\n   ```shell script\n   cd build; cmake ..; make -j8; cd ..;\n   cd python; python setup.py install --user; cd ..;\n   cd topi/python; python setup.py install --user; cd ../.. # for tvm v0.6, ignore for tvm v0.7 or higher\n   ```\n5. Validate that you have successfully installed TVM by\n   ```python\n   import tvm\n   print(tvm.__version__)\n   ```\n\n#### 4.1.3 Install TASO\nPlease refer the [TASO installation guide](https://github.com/jiazhihao/TASO/blob/master/INSTALL.md) for the instructions to install TASO. \n\n#### 4.1.4 Install Tensorflow\n```shell script\npip install tensorflow\n```\n\n#### 4.1.5 Install PyTorch\n```shell script\nconda install pytorch torchvision -c pytorch\n```\n\n\n#### 4.1.5 Lock GPU Clock Rate\nBecause modern GPU can adjust the execution clock rate dynamically to reduce energy consumption when the device is not busy. \nWe can lock the clock rate to make the experiment results more accurate and consistent.\nBefore conducting the experiments, run the following command (need sudo-privilege).\n```shell script\nsudo nvidia-smi --lock-gpu-clocks=MIN_CLOCK,MAX_CLOCK\n```\nThis command lock the gpu clocks in the specified range `[MIN_CLOCK, MAX_CLOCK]`. \nIn our experiments, we set both `MIN_CLOCK` and `MAX_CLOCK` to 1530, \nwhich is the maximum clock rate NVIDIA Tesla V100 SXM2 supports. \nYou can use the following command to query the clock rates supported by your NVIDIA GPU,\n```shell script\nnvidia-smi --query --display=SUPPORTED_CLOCKS\n```\nand use this command to watch the current GPU clock rate:\n```shell script\nwatch nvidia-smi --query --display=CLOCK\n```\nAfter the experiments, you can run the following command to reset your GPU clock\n```shell script\nsudo nvidia-smi --reset-gpu-clocks\n```\nRefer [here](https://www.microway.com/hpc-tech-tips/nvidia-smi_control-your-gpus/) and `man nvidia-smi` for more information.\n\n### 4.2 Experiments and ablation study\nOnce the experiment environment has been setup, we can conduct the 7 experiments and ablation study in the paper. \nAll the experiments results in the paper (shown in the figure) are the average of five repeated experiment results. \nTo save the time, the code in this section only conducts \u003cem\u003e one \u003c/em\u003e time. \nAll the differences between the output and numbers in paper are within the allowable error range.\n\n#### 4.2.1 Comparison of Different Schedules\n\nThis experiment compare the following schedules: Sequential, Greedy, IOS-Merge, IOS-Parallel, and IOS-Both. \nFor fair comparison, all schedules are executed in the same execution engine (IOS runtime).\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/schedules.png\" width=600\u003e\n  \n  End-to-end performance comparison of different schedules across different CNNs on batch size one. \n  The throughput is normalized to the best one for each model.\n\u003c/div\u003e\n\nThe following table gives the latency (ms) for each model and schedule.\n\n|   Schedule   | Sequential | Greedy | IOS-Merge | IOS-Parallel | IOS-Both |\n|:------------:|:----------:|:------:|:---------:|:------------:|:--------:|\n| Inception V3 |    6.51    |  4.62  |    5.39   |     4.11     |   4.03   |\n|   RandWire   |    8.49    |  6.27  |    8.54   |     6.02     |   6.02   |\n|    NasNet    |    22.95   |  16.78 |   22.94   |     16.04    |   16.04  |\n|  SqueezeNet  |    0.86    |  0.98  |    0.74   |     0.82     |   0.73   |\n|    GeoMean   |    5.74    |  4.67  |    5.28   |     4.24     |   4.11   |\n\nCommand:\n```shell script\ncd experiments/latency; sh run_expr_schedules.sh; cd ../..\n```\nKey output:\n```text\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 6.25 ms\nModel: inception_v3 | Optimization: Greedy          | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 4.62 ms\nModel: inception_v3 | Optimization: IOS-Merge       | Batchsize: 1  | Optimization cost: 1 sec    | Latency: 5.13 ms\nModel: inception_v3 | Optimization: IOS-Parallel    | Batchsize: 1  | Optimization cost: 48 sec   | Latency: 4.06 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 1  | Optimization cost: 48 sec   | Latency: 3.94 ms\nModel: randwire     | Optimization: Sequential      | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 8.53 ms\nModel: randwire     | Optimization: Greedy          | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 6.27 ms\nModel: randwire     | Optimization: IOS-Merge       | Batchsize: 1  | Optimization cost: 3 sec    | Latency: 8.58 ms\nModel: randwire     | Optimization: IOS-Parallel    | Batchsize: 1  | Optimization cost: 4386 sec | Latency: 5.80 ms\nModel: randwire     | Optimization: IOS-Both        | Batchsize: 1  | Optimization cost: 4407 sec | Latency: 5.78 ms\nModel: nasnet       | Optimization: Sequential      | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 23.02 ms\nModel: nasnet       | Optimization: Greedy          | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 16.60 ms\nModel: nasnet       | Optimization: IOS-Merge       | Batchsize: 1  | Optimization cost: 63 sec   | Latency: 23.06 ms\nModel: nasnet       | Optimization: IOS-Parallel    | Batchsize: 1  | Optimization cost: 3591 sec | Latency: 15.87 ms\nModel: nasnet       | Optimization: IOS-Both        | Batchsize: 1  | Optimization cost: 3653 sec | Latency: 15.85 ms\nModel: squeezenet   | Optimization: Sequential      | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 0.89 ms\nModel: squeezenet   | Optimization: Greedy          | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 0.98 ms\nModel: squeezenet   | Optimization: IOS-Merge       | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 0.68 ms\nModel: squeezenet   | Optimization: IOS-Parallel    | Batchsize: 1  | Optimization cost: 1 sec    | Latency: 0.86 ms\nModel: squeezenet   | Optimization: IOS-Both        | Batchsize: 1  | Optimization cost: 1 sec    | Latency: 0.68 ms\n```\n\n#### 4.2.2 Comparison of cuDNN-based Frameworks\n\nThis experiment compare IOS with other cuDNN-based frameworks/libraries: Tensorflow, TVM-cuDNN, TASO, and TensorRT. \nTVM-cuDNN is the TVM framework, but convolution uses the cuDNN kernel (`target = 'cuda -libs=cudnn'`). \n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/frameworks_comparison.png\" width=600\u003e\n  \n  End-to-end performance comparison of different frame-works across different CNNs on batch size one. \n  The throughput is normalized to the best one for each model.\n\u003c/div\u003e\n\nThe following table gives the latency (ms) for each model and framework/library.\n\n|  Frameworks  | Tensorflow | Tensorflow-XLA | TASO  | TVM-cuDNN | TensorRT |  IOS  |\n|:------------:|:----------:|:--------------:|:-----:|:---------:|:--------:|:-----:|\n| Inception V3 |    7.95    |      9.95      |  5.70 |    4.88   |   5.21   |  4.03 |\n|   RandWire   |    12.06   |      16.61     |  8.42 |    6.86   |   8.33   |  6.02 |\n|    NasNet    |    24.73   |      34.66     | 21.29 |   26.87   |   24.66  | 16.04 |\n|  SqueezeNet  |    2.63    |      4.08      |  0.82 |    0.90   |   0.80   |  0.73 |\n|    GeoMean   |    8.88    |      12.36     |  5.37 |    5.54   |   5.41   |  4.11 |\n\nCommand:\n```shell script\ncd experiments/latency; sh run_expr_frameworks.sh; cd ../..\n```\n\nKey output:\n```text\nModel: inception_v3 | Optimization: Tensorflow      | Batchsize: 1  | Optimization cost: 4 sec    | Latency: 7.70 ms\nModel: inception_v3 | Optimization: Tensorflow-XLA  | Batchsize: 1  | Optimization cost: 6 sec    | Latency: 9.37 ms\nModel: inception_v3 | Optimization: TASO            | Batchsize: 1  | Optimization cost: 50 sec   | Latency: 5.47 ms\nModel: inception_v3 | Optimization: TVM-cuDNN       | Batchsize: 1  | Optimization cost: 29 sec   | Latency: 4.88 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 1  | Optimization cost: 17 sec   | Latency: 4.77 ms\nModel: randwire     | Optimization: Tensorflow      | Batchsize: 1  | Optimization cost: 5 sec    | Latency: 11.31 ms\nModel: randwire     | Optimization: Tensorflow-XLA  | Batchsize: 1  | Optimization cost: 12 sec   | Latency: 14.86 ms\nModel: randwire     | Optimization: TASO            | Batchsize: 1  | Optimization cost: 5222 sec | Latency: 8.65 ms\nModel: randwire     | Optimization: TVM-cuDNN       | Batchsize: 1  | Optimization cost: 28 sec   | Latency: 6.82 ms\nModel: randwire     | Optimization: TensorRT        | Batchsize: 1  | Optimization cost: 108 sec  | Latency: 7.93 ms\nModel: nasnet       | Optimization: Tensorflow      | Batchsize: 1  | Optimization cost: 8 sec    | Latency: 24.14 ms\nModel: nasnet       | Optimization: Tensorflow-XLA  | Batchsize: 1  | Optimization cost: 19 sec   | Latency: 32.47 ms\nModel: nasnet       | Optimization: TASO            | Batchsize: 1  | Optimization cost: 36 sec   | Latency: 21.26 ms\nModel: nasnet       | Optimization: TVM-cuDNN       | Batchsize: 1  | Optimization cost: 54 sec   | Latency: 26.83 ms\nModel: nasnet       | Optimization: TensorRT        | Batchsize: 1  | Optimization cost: 246 sec  | Latency: 24.38 ms\nModel: squeezenet   | Optimization: Tensorflow      | Batchsize: 1  | Optimization cost: 2 sec    | Latency: 2.59 ms\nModel: squeezenet   | Optimization: Tensorflow-XLA  | Batchsize: 1  | Optimization cost: 4 sec    | Latency: 3.71 ms\nModel: squeezenet   | Optimization: TASO            | Batchsize: 1  | Optimization cost: 3 sec    | Latency: 0.82 ms\nModel: squeezenet   | Optimization: TVM-cuDNN       | Batchsize: 1  | Optimization cost: 11 sec   | Latency: 0.88 ms\nModel: squeezenet   | Optimization: TensorRT        | Batchsize: 1  | Optimization cost: 8 sec    | Latency: 0.81 ms\n```\n\n#### 4.2.3 Utilization Profiling\nThis experiment profiles the active warps of sample network defined in [Usage](#3-usage) under Sequential schedule and IOS-Both schedule. \nThe NVIDIA CUDA Profiling Tools Interface ([CUPTI](https://developer.nvidia.com/cupti)) is used to profile. \n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/utilization.png\" width=600\u003e\n  \n  The profiling of active warps for the sample network defined in `experiments/sample.py`. \n  \u003ca href=\"https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/issueefficiency.htm\"\u003eActive warps\u003c/a\u003e \n  indicates the number of actually executed instructions (1 warp = 32 inst.) on the device and can be used to show the device utilization. \n  There is about 2.1 ms between two timestamps on average. \n  IOS achieves higher device utilization (active warps/ms) than the sequential schedule.\n\u003c/div\u003e\n\nCommand:\n```shell script\ncd experiments/utilization; sh run_expr_utilization.sh; cd ../..\n```\n\nAbove command would generate a plot image named `active_warps.png`, which can reflect the real device utilization.\nHere is a sample of the figure:\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/active_warps.png\" width=500\u003e\n\u003c/div\u003e\n\n\n#### 4.2.4 Specialized Scheduling is Beneficial\n\nIOS support specialized scheduling for different devices and different batch sizes. \n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/specialization.png\" width=600\u003e\n  \n  Latency (ms) of specialized schedules for batch size 1, 32 and 128, and specialized schedules for NVIDIA Tesla K80 and V100. \n  The best performance is achieved when the schedule is specialized for each batch size and device. \n  Each row is the batch size or device that the model is executed on. \n  Each column is the batch size or device that IOS optimized for. \n  InceptionV3 is used as benchmark.\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/specialization_example.png\" width=600\u003e\n  \n  The schedule found by IOS for the last block of Inception V3. \n  Operator a-e are convolution operator while operator P is the pooling operator. \n  Schedule (1) and (2) are optimized for batch size 1 and 32 respectively. \n  In schedule (1), there are two stages while in schedule (2) there are 4 stages. \n  Schedule (1) is 28% faster than schedule (2) on batch size 1. \n  Schedule (2) is 8% faster than schedule (1) on batch size 32.\n\u003c/div\u003e\n\nWe first optimize for different batch sizes (1, 32, and 128) to get the schedule specialized for different batch sizes (for your simplicity, we have put the schedules we got in the `schedules` directory). \nThen we execute the network Inception V3 with different batch sizes and specialized schedules (there are 25 combinations, 5 by 5). \n\nTo explore the specialization for different batch sizes, run the following command:\n```shell script\ncd experiments/specialization; sh run_expr_spec_batchsize.sh; cd ../..\n```\n\nKey output:\n```text\nOptimized for BS 1    Execute with BS 1    Latency: 4.04 ms\nOptimized for BS 1    Execute with BS 32   Latency: 29.21 ms\nOptimized for BS 1    Execute with BS 128  Latency: 105.87 ms\nOptimized for BS 32   Execute with BS 1    Latency: 4.45 ms\nOptimized for BS 32   Execute with BS 32   Latency: 27.62 ms\nOptimized for BS 32   Execute with BS 128  Latency: 103.58 ms\nOptimized for BS 128  Execute with BS 1    Latency: 4.58 ms\nOptimized for BS 128  Execute with BS 32   Latency: 27.85 ms\nOptimized for BS 128  Execute with BS 128  Latency: 102.96 ms\n```\n\nTo explore the specialization for different devices, we need a different GPU device. In our experiment, we take NVIDIA Tesla K80 as the second device.\nWe first optimize the network on different devices to get the specialized schedules (we also put them in `schedules` directory). \nThen we execute the network with different specialized schedules on the two devices (there are 4 combinations, 2 by 2).\n\nRun the following commands on NVIDIA Tesla V100 and K80 with `DEVICE=v100` and `DEVICE=k80`, respectively.\n```shell script\ncd experiments/specialization; sh run_expr_spec_device.sh DEVICE; cd ../..\n```\n\nKey output log when executed on V100 and `DEVICE=v100`:\n```text\nRun on v100\nOptimized for k80   Execute with v100  Latency: 4.42 ms\nOptimized for v100  Execute with v100  Latency: 4.02 ms\n```\n\nKey output log when executed on K80 and `DEVICE=k80`:\n```text\nRun on k80\nOptimized for k80   Execute with k80   Latency: 13.93 ms\nOptimized for v100  Execute with k80   Latency: 14.64 ms\n```\n(Because NVIDIA Tesla K80 can not lock the gpu clock, you need to warmup the gpu to make it working with highest clock rate to get above result.)\n\nExperiments show that specialized scheduling is beneficial.\n\n#### 4.2.5 Schedule Pruning Reduces Search Time\n\nTo allow users to trade off the search time and optimized schedule latency, we introduce the schedule pruning strategy to reduce the search time. \nThis experiment shows the trade-off between the search time and schedule latency.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/reduce_optimization_cost.png\" width=500\u003e\n  \n  Trade-off between the optimized latency and the optimization cost for Inception V3 and NasNet.\n\u003c/div\u003e\n\nCommand:\n```shell script\ncd experiments/latency; sh run_expr_prune.sh; cd ../..\n```\n\nKey output:\n```text\nModel: inception_v3 | Optimization: IOS-Both(r=1, s=3)   | Batchsize: 1  | Optimization cost: 5 sec    | Latency: 4.22 ms\nModel: inception_v3 | Optimization: IOS-Both(r=1, s=8)   | Batchsize: 1  | Optimization cost: 7 sec    | Latency: 4.06 ms\nModel: inception_v3 | Optimization: IOS-Both(r=2, s=3)   | Batchsize: 1  | Optimization cost: 17 sec   | Latency: 4.02 ms\nModel: inception_v3 | Optimization: IOS-Both(r=2, s=8)   | Batchsize: 1  | Optimization cost: 25 sec   | Latency: 3.99 ms\nModel: inception_v3 | Optimization: IOS-Both(r=3, s=3)   | Batchsize: 1  | Optimization cost: 29 sec   | Latency: 3.99 ms\nModel: inception_v3 | Optimization: IOS-Both(r=3, s=8)   | Batchsize: 1  | Optimization cost: 43 sec   | Latency: 3.96 ms\nModel: nasnet       | Optimization: IOS-Both(r=1, s=3)   | Batchsize: 1  | Optimization cost: 137 sec  | Latency: 17.54 ms\nModel: nasnet       | Optimization: IOS-Both(r=1, s=8)   | Batchsize: 1  | Optimization cost: 492 sec  | Latency: 16.54 ms\nModel: nasnet       | Optimization: IOS-Both(r=2, s=3)   | Batchsize: 1  | Optimization cost: 360 sec  | Latency: 16.85 ms\nModel: nasnet       | Optimization: IOS-Both(r=2, s=8)   | Batchsize: 1  | Optimization cost: 2648 sec | Latency: 16.09 ms\nModel: nasnet       | Optimization: IOS-Both(r=3, s=3)   | Batchsize: 1  | Optimization cost: 641 sec  | Latency: 16.73 ms\nModel: nasnet       | Optimization: IOS-Both(r=3, s=8)   | Batchsize: 1  | Optimization cost: 3412 sec | Latency: 15.91 ms\n```\n\n#### 4.2.6 Consistent Improvement for Different Batch Sizes\n\nIOS can achieve consistent improvement for different batch sizes. In this experiment, we measure the latency of Inception V3 on batch size 1, 16, 32, 64, 128. \nExperiment result show that IOS consistently outperforms TensorRT in terms of throughput.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/large_batchsize.png\" width=600\u003e\n  \n  The throughput comparison of Sequential schedule, TensorRT and IOS on batch size 1, 16, 32, 64 and 128 for Inception V3. \n\u003c/div\u003e\n\nCommand:\n```shell script\ncd experiments/latency; sh run_expr_batchsize.sh; cd ../..\n```\n\nKey output:\n```text\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 1  | Optimization cost: 0 sec    | Latency: 6.20 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 1  | Optimization cost: 17 sec   | Latency: 4.82 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 1  | Optimization cost: 48 sec   | Latency: 3.94 ms\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 16 | Optimization cost: 0 sec    | Latency: 17.95 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 16 | Optimization cost: 8 sec    | Latency: 17.82 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 16 | Optimization cost: 131 sec  | Latency: 15.17 ms\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 32 | Optimization cost: 0 sec    | Latency: 30.54 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 32 | Optimization cost: 9 sec    | Latency: 29.97 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 32 | Optimization cost: 207 sec  | Latency: 27.00 ms\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 64 | Optimization cost: 0 sec    | Latency: 55.67 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 64 | Optimization cost: 11 sec   | Latency: 56.25 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 64 | Optimization cost: 368 sec  | Latency: 51.11 ms\nModel: inception_v3 | Optimization: Sequential      | Batchsize: 128 | Optimization cost: 0 sec    | Latency: 108.55 ms\nModel: inception_v3 | Optimization: TensorRT        | Batchsize: 128 | Optimization cost: 16 sec   | Latency: 106.84 ms\nModel: inception_v3 | Optimization: IOS-Both        | Batchsize: 128 | Optimization cost: 711 sec  | Latency: 102.74 ms\n```\n\n#### 4.2.7 Intra- and Inter-Operator Parallelism\n\nAutoTVM is specialized for improvement the efficiency of the kernel by searching a highly optimized schedule for the kernel itself. \nCurrent IOS is implemented based on vendor-provided library cuDNN. \nWe compare both of them to give us more insight about the intra- and inter-operator parallelism.\nBecause AutoTVM is time consuming (it takes 26 hours on a 8-V100 server to optimize the four benchmark networks), we provide the schedule configs tuned by us in `tvm_schedule_configs` directory. \nYou can use these configs directly to reproduce the experiments.\nPlease note that these schedule configs are optimized for NVIDIA Tesla V100 SXM2 with driver 450.51.05 and cuda toolkit 10.2 using TVM v0.6. \nIf you want to tune the network by yourself, just delete the `./schedules` directory and we would tune the network using TVM and store the tuned schedule configs in `./tvm_schedule_configs` automatically.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"./figures/autotvm.png\" width=600\u003e\n  \n  End-to-end performance comparison between TVM-AutoTune and IOS. \n  TVM-AutoTune and IOS are orthogonal because TVM focuses on the intra-operator parallelism while IOS focuses on inter-operator parallelism. \n  They can be combined to further boost the inference performance. \n  The optimization cost of IOS is two orders of magnitude less than TVM.\n\u003c/div\u003e\n\nCommand:\n```shell script\ncd experiments/latency; sh run_expr_autotvm.sh; cd ../..\n```\n\nKey output:\n```text\nModel: inception_v3 | Optimization: TVM-AutoTune    | Batchsize: 1  | Optimization cost: 21 sec   | Latency: 4.95 ms\nModel: randwire     | Optimization: TVM-AutoTune    | Batchsize: 1  | Optimization cost: 26 sec   | Latency: 5.26 ms\nModel: nasnet       | Optimization: TVM-AutoTune    | Batchsize: 1  | Optimization cost: 28 sec   | Latency: 14.67 ms\nModel: squeezenet   | Optimization: TVM-AutoTune    | Batchsize: 1  | Optimization cost: 13 sec   | Latency: 0.75 ms\n```\n(The `Optimization cost` shown in the output is the time used to compile the network and measure latency, which does not include the time for auto-tuning, because the pre-tuned configs are used.\nIt takes about 26 hours on a 8-V100 server to tune the four networks.)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Finter-operator-scheduler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmit-han-lab%2Finter-operator-scheduler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmit-han-lab%2Finter-operator-scheduler/lists"}