{"id":13699996,"url":"https://github.com/baidu-research/DeepBench","last_synced_at":"2025-05-04T18:34:00.149Z","repository":{"id":41045291,"uuid":"69068565","full_name":"baidu-research/DeepBench","owner":"baidu-research","description":"Benchmarking Deep Learning operations on different hardware","archived":false,"fork":false,"pushed_at":"2021-04-25T09:13:47.000Z","size":5516,"stargazers_count":1082,"open_issues_count":23,"forks_count":235,"subscribers_count":109,"default_branch":"master","last_synced_at":"2025-04-09T04:01:52.178Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baidu-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-09-23T23:57:05.000Z","updated_at":"2025-04-07T22:30:25.000Z","dependencies_parsed_at":"2022-09-20T21:00:49.176Z","dependency_job_id":null,"html_url":"https://github.com/baidu-research/DeepBench","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu-research%2FDeepBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu-research%2FDeepBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu-research%2FDeepBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baidu-research%2FDeepBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baidu-research","download_url":"https://codeload.github.com/baidu-research/DeepBench/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252382838,"owners_count":21739225,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T20:00:46.901Z","updated_at":"2025-05-04T18:33:56.976Z","avatar_url":"https://github.com/baidu-research.png","language":"C++","funding_links":[],"categories":["Benchmark","C++","Example Implementations 💡"],"sub_categories":["Blogs 🖋️"],"readme":"![Baidu Logo](/doc/baidu-research-logo-small.png)\n\n- [DeepBench](#deepbench)\n- [Types of Operations](#types-of-operations)\n- [Training Benchmark](#training-benchmark)\n- [Inference Benchmark](#inference-benchmark)\n- [Supported Ops \u0026 Precision](#supported-ops-and-precision)\n- [Results](#results)\n- [Get Involved](#get-involved)\n- [Getting the Code](#getting-the-code)\n\n\n# DeepBench\n\nThe primary purpose of DeepBench is to benchmark operations that are\nimportant to deep learning on different hardware platforms. Although\nthe fundamental computations behind deep learning are well understood,\nthe way they are used in practice can be surprisingly diverse. For\nexample, a matrix multiplication may be compute-bound,\nbandwidth-bound, or occupancy-bound, based on the size of the matrices\nbeing multiplied and the kernel implementation. Because every deep\nlearning model uses these operations with different parameters, the\noptimization space for hardware and software targeting deep learning\nis large and underspecified.\n\nDeepBench attempts to answer the question, \"Which hardware provides\nthe best performance on the basic operations used for deep\nneural networks?\".  We specify these operations at a low level,\nsuitable for use in hardware simulators for groups building new\nprocessors targeted at deep learning. DeepBench includes operations\nand workloads that are important to both training and inference.\n\n## Where does DeepBench fit in? \n\nThe Deep Learning eco system consists of several different pieces. \nWe wanted to highlight where DeepBench fits into this eco system. \nThe diagram below describes the software and hardware components involved with deep learning.\nAt the very top, deep learning frameworks like Baidu's [PaddlePaddle](https://github.com/baidu/Paddle), Theano, \nTensorFlow, Torch etc. All these frameworks allow deep learning researchers to build models. They include basic building \nblocks like layers which can be connected in different ways to create a model. In order to train the deep learning models, \nthe frameworks work with underlying neural network libraries such as NVIDIA's cuDNN and Intel's MKL. \nThese libraries implement operations such as matrix multiply that are important to deep learning models. \nFinally, the models are trained on hardware like NVIDIA GPUs or Intel's Xeon Phi processor.\n\n\u003cimg src=\"doc/deepbench.png\" height=300\u003e\n\nDeepBench uses the neural network libraries to benchmark the performance of basic operations on different hardware.\nIt does not work with deep learning frameworks or deep learning models built for applications. \nWe cannot measure the time required to train an entire model using DeepBench.\nThe performance characteristics of models built for different applications are very different from each other. \nTherefore, we are benchmarking the underlying operations involved in a deep learning model. \nBenchmarking these operations will help raise awareness amongst hardware vendors and software developers \nabout the bottlenecks in deep learning training and inference.\n\n## Methodology\n\nDeepBench consists of a set of basic operations (dense matrix\nmultiplies, convolutions and communication) as well as some recurrent\nlayer types.  There are Excel spreadsheets (`DeepBenchKernels_train.xlsx` \u0026 \n`DeepBenchKernels_inference.xlsx`) in this repository that describes all \nof the sizes for training and inference respectively.\n\nFor training, both forward and backward operations are tested. The precision\nrequirements for training and inference are discussed in the sections below.\n\nWe will use vendor supplied libraries even if faster independent\nlibraries exist or faster results have been published. Most users will\ndefault to the vendor supplied libraries and as such the vendor\nsupplied libraries are most representative of users' experience.\n\n## Entry\n\nDeepBench includes training results for seven hardware platforms, NVIDIA's\nTitanX, M40, TitanX Pascal, TitanXp, 1080 Ti, P100 and Intel's Knights\nLanding. Inference results are included for three server platforms, NVIDIA's\nTitanX Pascal, TitanXp and 1080 Ti. Inference results are also included \nfor three mobile devices iPhone 6 \u00267, RaspBerry Pi 3. We provide an overview of the\nresults and all results are available in the `results` folder. We will\ngladly accept pull requests for new hardware platforms.\n\n\n# Types of Operations\n\n## Dense Matrix Multiplies\n\nDense matrix multiplies exist in almost all deep neural networks\ntoday.  They are used to implement fully connected layers and vanilla\nRNNs and are building blocks for other types of recurrent layers.\nSometimes they are also used as a quick way to implement novel layer\ntypes for which custom code doesn't exist.\n\nWhen performing the GEMM operation `A * B = C`, either or both of `A`\nand `B` can be optionally transposed. Common terminology to describe a matrix problem \nis the triple (M, N, K), which describes the sizes of the matrices involved, \nand the “op” which tells us which matrices (if any) are transposed. The figure below\ndescribes how the triple (M, N, K) correspond to the sizes of the matrices being multiplied.\n\n\u003cimg src=\"/doc/gemm-diag.png\" width=\"550\" /\u003e\n\nThe variant where both matrices\nare transposed is not used in neural networks.  The other three\nvariants *are* used, but they need not be implemented as a call to\n`SGEMM` with those transpose descriptors.  Sometimes it can be faster\nto perform an in-place transpose followed by the appropriate\nmultiplication and a transpose back.  Such optimizations should be\ndetailed in the spreadsheet.\n\nThe constant coefficients alpha and beta should both be 1.0 so that no\nwork is elided.\n\n## Convolutions\n\nConvolutions make up the vast majority of flops in networks that\noperate on images and videos and form important parts of networks such\nas speech and natural language modeling, thus making them perhaps the\nsingle most important layer from a performance perspective.\n\nConvolutions have 4 or 5 dimensional inputs and outputs giving rise to\na large number of possible orderings for these dimensions.  For the\nfirst version of the benchmark we are only concerned with performance\nin NCHW format i.e.  data is presented in image, feature maps, rows\nand columns.\n\nThere are many techniques for computing convolutions that are optimal\nfor different sizes of the filter and image, including:  direct, matrix multiply\nbased, FFT based, and Winograd based approaches.  In the first version\nof this benchmark, we are not concerned about the accuracy of the\ndifferent approaches since the general consensus is that 32-bit\nfloating point is accurate *enough* for each of them. We have noted\nthe approach used for each size in the spreadsheet.\n\n## Recurrent Layers\n\nRecurrent layers are usually made up of some combination of the above\noperations and also simpler operations such as unary or binary\noperations which aren't very compute intensive and generally constitute a\nsmall percentage of overall runtime.  However, the GEMM and\nconvolution operations are relatively small in recurrent layers, \nso the cost of these smaller operations can become significant.  This is especially true if there\nis a high fixed overhead associated with starting a computation.  It\nis also possible to use alternate storage formats for the recurrent\nmatrices because the cost of converting to a new storage format can be\namortized over the many steps of the recurrent computation.  If this\nis done, the time to convert to and from the custom format should be\nincluded in the overall time.\n\nThese factors lead to many optimization possibilities both within a\ntime step and across a sequence of time steps such that measuring the\nraw performance of the operations is not necessarily\nrepresentative of the performance of an entire recurrent layer.  In\nthis benchmark we focus on only one recurrent layer, even though there\nare even more optimization opportunities if one considers stacks of\nthem.\n\nThe calculation of the inputs should not be included in the time for\nthe recurrent layer calculation since it can be calculated as one\nlarge multiply and then consumed by the actual recurrent calculation.\nSo in: h_t = g(Wx_t + Uh_t-1) the time for the calculation of Wx_t for\nall t should not be included in the time for the recurrent layer.\n\nThe backward calculation should calculate the updates with respect to\nthe weights but not the inputs.  All the recurrent work is done to\ncalculate the weight updates, so calculating the updates with respect\nto the inputs as well just obscures what we are trying to measure.\n\nDeepBench includes support for three types of recurrent cells; \nvanilla RNNs, LSTMs and GRUs. The non-linearity for vanilla RNNs \nshould be a ReLU.  The internal non-linearities of the LSTM should\nbe the standard operations - sigmoid for the gates and tanh for \nthe activations.  The LSTM should not have peephole connections.\nThe internal of the GRU should be a sigmoid for reset and update\ngates. The output gate non linearity should be a ReLU.\n\n\n## All-Reduce\n\nNeural networks today are often trained across multiple GPUs or even\nmultiple systems, each with multiple GPUs.  There are two main categories of techniques for\ndoing this: synchronous and asynchronous. Synchronous techniques rely\non keeping the parameters on all instances of the model synchronized, usually by making\nsure all instances of the model have the same copy of the gradients before taking an\noptimization step.  The\n[Message Passing Interface (MPI)](https://en.wikipedia.org/wiki/Message_Passing_Interface)\nprimitive usually used to perform this\noperation is called All-Reduce. There are many ways to implement\nAll-Reduce based on the number of ranks, the size of the data, and the\ntopology of the network.  This benchmark places no constraints on the\nimplementation other than that it should be\ndeterministic. Asynchronous methods are quite varied and in this\nversion of the benchmark we will not be attempting to test these\nmethods.\n\nIn order to evaluate All-Reduce, we use the following libraries and benchmarks:\n* [NVIDIA's NCCL](https://developer.nvidia.com/nccl)\n* [Ohio State University (OSU) Benchmarks](http://mvapich.cse.ohio-state.edu/benchmarks/)\n* [Baidu's Allreduce](https://github.com/baidu-research/baidu-allreduce/)\n* [Intel's MLSL](https://github.com/intel/MLSL)\n\nThe NCCL library can be build without MPI (for single node) and with MPI (for multinode) as shown in https://github.com/NVIDIA/nccl-tests. \nWe therefore have two versions of NCCL for the single node in the experiments. For  multinode experiments,\nwe use only NCCL with MPI, the benchmark from OSU, and Baidu's Allreduce implementation. \nWe report the shortest latency achieved from all implementations for each configuration.\n\nIntel(R) Machine Learning Scaling Library (Intel(R) MLSL) is a library\nproviding an efficient implementation of communication patterns used in deep learning.\nIn order to evaluate All-Reduce performance, we use All-Reduce benchmark from OSU.\n\n#### Topology for NVIDIA 8 GPU System\nEach node has two CPU sockets (dual root topology), and each socket has a PCIe root complex.  For each socket there are two PLX switches that are each connected to the CPU socket via 16 lanes of PCIe v3.  There are two GPUs on each PLX switch. All pairs of GPUs communicate simultaneously over 16 lanes of PCIe v3. The two CPU sockets are connected via Intel QPI. The interconnect across nodes is InfiniBand FDR. The figure below shows a schematic diagram of one our nodes, where all devices connected by the same PCI\nroot complex are encapsulated in a dotted box. In our experiments, P100, TitanX Maxwell and M40 were such systems.\n\n![Topology of NVIDIA GPU system with 8 GPUs](/doc/topology-8gpu-system.png)\n\n#### Topology for NVIDIA 10 GPU System\nEach node has one CPU socket (single root topology) with two PLX switches, each switch are connected to 5 GPUs. The communication among the GPUs in the same PLX switch traverses through the PLX switch only, whereas \nthe communication to any GPU connected to the other PLX switch requires traversal both PLX switches along with the connecting PCIe bridge. In our experiments, TitanX Pascal, and 1080Ti were such systems.\n\n#### Topology for Intel Xeon Phi and Omni-Path System\nThe blocking All-Reduce latency is measured on Intel Xeon Phi processor 7250 on Intel’s internal Endeavor cluster\nwith Intel® Omni-Path Architecture (Intel® OPA) series 100 fabric with fat-tree topology, using Intel MPI 2017 Update 3 and Intel MLSL 2017 Update 2 Preview.\n\n# Training Benchmark\n\nThe training benchmark includes support for all the operations discussed\nabove. The `DeepBenchKernels_train.xlsx` file contains the entire list of\nkernels for the training benchmark.\n\n## Training Precision\n\n\nWhile training deep learning models, most researchers typically use \nsingle precision floating point numbers for all compute kernels. \nAcademic research has demonstrated that reduced precision training works \nfor several different models trained on limited datasets. In our experience, \nwe’ve found that 16 bit half precision floating point numbers are \nsufficient to train large deep learning models on large datasets reliably. \nTraining with half precision numbers allows hardware vendors to better \nutilize the available computing power. In addition, the parameters require \nhalf the total storage for the entire model.\n\n\nDeepBench specifies the minimum precision requirements for training. We are specifying \nprecision for multiply and add for all the operations. **The minimum precision \nfor multiplication and addition is set to 16 and 32 bits respectively.** \nNone of the currently available hardware supports 16 bit multiply and 32 bit accumulate. \nWe will accept results on any hardware platform that satisfies this minimum precision \nrequirement. All results will include the precision that is used for the benchmark.\n\n# Inference Benchmark\n\nBenchmarking inference is a very challenging problem. There are many applications \nthat have been enabled by deep learning and each of them have their unique \nperformance characteristics and requirements. We selected applications for benchmarking\nthat receive high user traffic. We are also including kernels from deep learning models\nthat are used across several different applications.\n\nFor the inference kernels, we cover the same set of operations as the training set i.e. \nmatrix multiply, convolution and recurrent operations. The kernels have some differences \nfrom the training counterparts. In the next few sections, we discuss the changes needed \nto benchmark inference workloads. The `DeepBenchKernels_inference.xlsx` file contains\nthe complete list of kernels for the training benchmark.\n\n\n## Deployment Platform\n\nLarge scale real world applications such as image search, language translation and \nspeech recognition are typically deployed on servers located in data centers. The client \nsends the request over the internet which is processed on the remote server hosting the \ndeep learning model. The remote server is typically a powerful machine consisting of many \nprocessors. The memory and compute capabilities are large enough to host very large deep \nlearning models. The downside of deploying the model on the server is the latency depends \non the network bandwidth between the client and the server. It also requires the user to \nbe connected to the internet. In order to address these issues, several models are being \ndeployed on end devices.  On-device deployment enables deep learning models to have lower \nlatency and are always available regardless of internet connectivity. However, these models \nneed to be smaller in order to fit within the power and memory constraints of mobile and \nembedded devices.\n\nIn DeepBench, we measure the performance of inference kernels on both server and mobile \nplatforms. Hardware vendors or users can \nrun the appropriate benchmarks and add their results to the repository. We provide an overview \nof the results below and detailed results are available in the `results/inference` folder. \nWe will gladly accept pull requests for new hardware platforms.\n\n## Inference Batch Size\n\nIn order to meet latency requirements of user requests, most internet applications process \nrequests individually as they arrive at the data center. This makes for a straightforward \napplication where each request is handled by a single thread. However, this is inefficient \nfor two reasons. Processing requests individually makes the operation bandwidth bound as \nthe processor needs to load weights of the network. This makes it harder for processor to \neffectively utilize the on chip caches. Secondly, the amount of parallelism that can be \nexploited to classify one request is limited, making it difficult to exploit SIMD or multicore \nparallelism. RNNs are especially challenging to deploy because evaluating RNNs sample by sample \nrelies on matrix vector multiplication, which are bandwidth bound and difficult to parallelize.\n\nTo overcome these issues, we built a batching scheduler called Batch Dispatch which assembles \nstreams of data from user requests into batches before performing forward propagation on these \nbatches. In this case, there is a tradeoff between increased batch size, and consequently \nimproved efficiency, and increased latency. The more we buffer user requests to assemble a large batch, \nthe longer users must wait for their results. This places constraints on the amount of batching we can perform.\n\nIn practice, we’ve seen that batching requests up to 4 or 5 seems to work well for efficiency and \nlatency for data center deployment. In the case of deployment on devices, the batch size is limited to 1.\n\n## Inference Precision\n\nDeep Neural networks are trained using single precision or half precision floating point numbers. \nThe precision requirements for inference are significantly lower than training. Several different \nmodels can deployed with 8 bit representations for inference with little or no loss in accuracy \ncompared to their floating point models. **Therefore, for inference kernels, we’re specifying the \nminimum precision for multiplication and accumulation of 8 and 32 bits respectively.** Since \nall hardware platforms may not support this precision requirement, we will accept results for any \nplatform that satisfies this minimum requirement. All results will include the precision used for the benchmark.\n\nTo benchmark matrix multiplication with 8 bit inputs for ARM processors, \nwe use the Gemmlowp library. Convolution kernels from the ARM Compute Library are used for convolution benchmark. \nThe ARM Compute library only supports single precision convolutions. Low precision convolution \nsupport should be available shortly. The ARM Compute library doesn’t have any support for RNNs. \nTherefore, DeepBench does not include RNN results for ARM devices. We welcome contributions from other \nlibraries that support RNN operations for ARM devices.\n\nFor server deployment, we use the cudNN library and cuBLAS library for Nvidia GPUs. For Nvidia GPUs, \nRNN kerenels only support single precision and results are reported with the same. More details regarding \nwhich ops are supported on different processors can be found in later sections.\n\n## Sparse Operations\n\nA sparse neural network is one where most of the weights of the neural network are zero. \nThese zero weights don’t contribute in determining the prediction of the neural network. Sparse neural \nnetworks reduce memory and computation footprint which enables deep learning models to be deployed on \nmobile devices. Inference performance of RNNs is dominated by the memory bandwidth of the hardware, \nsince most of the work is simply reading in the parameters at every time step. Moving from a dense \ncalculation to a sparse one comes with a penalty, but if the sparsity factor is large enough, then \nthe smaller amount of data required by the sparse routines becomes a win.\n\n\u003cimg src=\"/doc/SparseNN.png\" width=\"550\" /\u003e\n\nThe more powerful server class processors used in data centers can generally perform inference quickly \nenough to serve one user, but in the data center performance per dollar is very important. Techniques \nsuch as sparsity that allow models to be evaluated faster enable more users to be served per GPU \nincreasing the effective performance per dollar.\n\nThere has been a lot of progress in developing sparse neural networks in the past couple of years. DeepBench \nincludes sparse matrix vector and sparse matrix multiply kernels. Based on our research, we’ve learnt \nthat neural networks with 90 to 95% sparsity can achieve relatively good performance compared to their \ndense baselines. However, current implementations of sparse matrix multiply are optimized for much higher \nsparsity (around 99% or higher). By including sparse kernels, we’re hoping to incentivize hardware vendors \nand software developers to build libraries that provide good performance for sparsity in the range of 90~95%.\n\nWe use the Eigen library to benchmark sparse operations on ARM devices. For GPU benchmarks, we use the cuSparse \nlibrary from Nvidia.\n\n## Measuring Latency\n\nMany inference applications have real time latency requirements. For example, speech interfaces \nrequire speech recognition models to return a result without a delay that is noticeable to a user. \nDeepBench kernels can be used as a starting point to measure the best case latency of individual \noperations. However, measuring full system latency is outside the scope of this release of DeepBench, \ngiven the focus on basic operations rather than complete applications. For example, a complete \napplication running on a mobile device might need to modify the power state of the system when \nstarting up. In another example, a complete server application might have a significant latency \ncomponent that is determined by a user’s network connection to the server. We may consider \naddressing operation latency in a future version of DeepBench.\n\n# Supported Ops and Precision\nIn this section, we document the support for the various operations across precisions for different processors.\nAs far as possible, we pick the precision that closely matches the minimum required precision. The precision \nrequirements are stated below again. However, there are cases where we need to benchmark higher precision operations. \nThe tables below highlight which operations are benchmarked for each processor.\n\n**Minimum Precision for training**: 16 bit multiply, 32 bit accumulate\n\n**Minimum Precision for inference**: 8 bit multiply, 32 bit accumulate\n\n## Training\n\nSingle precision results are available for 5 Nvidia GPUs and Intel's Xeon Phi processor. None of the available\nprocessors support 16 bit multiplication and 32 bit addition. Instead, we benchmark Nvidia's Psuedo FP16 mode\nwhere inputs/outputs are 16 bit but the compute is still in single precision. Support for mixed precision training\nis available in upcoming hardware processors.\n\n| Processor               | Single precision   | FP16 inputs/FP32 math   | FP16 inputs / Mixed Precision Math |\n| ----------------------- | ------------------ | ----------------------- | ---------------------------------- |\n| Nvidia TitanX Maxwell   | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia Tesla M40        | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia 1080Ti           | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia TitanX Pascal    | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia TitanXp          | GEMM, Conv, RNN    |                         |                                    |\n| Nvidia Tesla P100       | GEMM, Conv, RNN    | GEMM, Conv, RNN         |                                    |\n| Nvidia Tesla V100       | GEMM, Conv, RNN    |                         | GEMM, Conv, RNN                    |\n| Intel Xeon Phi 7250     | GEMM, Conv         |                         |                                    |\n\n\n## Server Deployment\n\nThe GEMM and convolution benchmark are run with 8 bit multiplication and 32 bit accumulate on \nNVIDIA processors. However, NVIDIA GPUs don't support all input sizes for this precision mode.\nInput sizes have to be a multiple of 4 to run in this precision mode. We have padded inputs dimensions \nto be multiples of 4 for all kernels. The cost of padding and discarding extra outputs is small\ncompared to the cost of the operation. The results spreadsheet indicates which of the kernels required\npadding. Sparse operations and Recurrent kernel results are reported in single precision since \nthe relevant libraries don't support low precision.\n\n| Processor                  | Single Precision             | Int8 multiply/32 bit accumulate | \n|-----------------------|------------------|-----------------------|\n| Nvidia 1080Ti              | RNN, Sparse GEMM | GEMM, Conv                                 |\n| Nvidia TitanX Pascal       | RNN, Sparse GEMM | GEMM, Conv                                 |\n| Nvidia TitanXp             | RNN, Sparse GEMM | GEMM, Conv                                 |\n\n## Device Deployment\n\nThe table below describes the inference device kernel results available on different processors, ops and \nprecision. We don't have any results for RNNs since no ARM libraries support RNNs. ARM Compute library\nis not yet supported on the iPhone.\n\n| Processor                  | Single Precision             | Int8 inputs/32 bit math | \n|-----------------------|------------------|-----------------------|\n| Raspberry Pi 3        | Conv                         | GEMM, Sparse GEMM               |\n| iPhone 6                   |                              | GEMM, Sparse GEMM               |\n| iPhone 7                   |                              | GEMM, Sparse GEMM               |\n\n\n# Results\nIn this section, we are documenting the performance for a few operations. \nThese are picked at random and are only meant to demonstrate the performance for a few applications.\n__The results below only include the time and TeraFLOPS for the fastest processor for the particular operation and parameters. The full results can be found in the `results` folder__. \n\nThe precision used for benchmarking the training and inference processors is listed at the top of the results file. \n\nTraining results can be found in the `results/training` folder which contains the following files:\n\n* `DeepBench_IA_KNL7250.xlsx`: Training results on Intel's Xeon Phi Processor\n* `DeepBench_NV_TitanX.xlsx`: Training results on NVIDIA's TitanX GPUs\n* `DeepBench_NV_M40.xlsx`: Training results on NVIDIA's M40 GPUs\n* `DeepBench_NV_TitanX_Pascal.xlsx`: Training results on NVIDIA's TitanX Pascal GPU\n* `DeepBench_NV_TitanXp.xlsx`: Training results on NVIDIA's TitanXp Pascal GPU\n* `DeepBench_NV_1080Ti.xlxs`: Training results on NVIDIA's 1080 Ti GPU\n* `DeepBench_NV_P100.xlsx`: Training results on NVIDIA's P100 GPU\n* `DeepBench_NV_V100.xlsx`: Training results on NVIDIA's V100 GPU\n\nDetailed inference results can be found in the `results/inference` folder which contains the following files:\n* `server/DeepBench_NV_TitanXp.xlsx`: Inference results on NVIDIA's TitanXp GPUs\n* `server/DeepBench_NV_TitanXp.xlsx`: Inference results on NVIDIA's TitanXp Pascal GPU\n* `server/DeepBench_NV_1080Ti.xlxs`: Inference results on NVIDIA's 1080 Ti GPU\n* `device/DeepBench_iPhone_7.xlsx` : Inference results on iPhone 7\n* `device/DeepBench_iPhone_6.xlsx` : Inference results on iPhone 6\n* `device/DeepBench_Raspberry_Pi_3.xlsx` : Inference results on Raspberry Pi 3\n\nThe software libraries (e.g. cuDNN, OpenMPI) used to benchmark performance are mentioned in each of Excel workbooks in `Specs` sheet.\nPlease feel free to ask us any clarifying questions.\n\nResults on more hardware platforms will be added once they are available. We welcome contributions from all hardware vendors.\n\n## Training Results\n\n### GEMM Results\n\n| Kernel                 | A Transpose | B Transpose | Application        | Time (ms) | TeraFLOPS | Processor     |\n|------------------------|-------------|-------------|--------------------|--------------|-----------|---------------|\n| M=1760, N=128, K=1760  | N           | N           | Speech Recognition | 0.07         | 10.72      | Tesla V100 Mixed Precision |\n| M=7860, N=64, K=2560   | N           | N           | Speech Recognition | 0.10         | 25.94      | Tesla V100 Mixed Precision |\n| M=2560, N=64, K=2560   | N           | N           | Speech Recognition | 0.08         | 10.11      | Tesla V100 Mixed Precision |\n| M=5124, N=9124, K=2560 | T           | N           | Speech Recognition | 8.73         | 27.43      | Tesla V100 Mixed Precision |\n| M=3072, N=128, K=1024  | T           | N           | Speech Recognition | 0.04         | 18.73      | Tesla V100 Mixed Precision |\n\n### Convolution Results\n\n| Input Size                        | Filter Size     | # of Filters   | Padding (h, w)   | Stride (h, w)   | Application          | Total Time (ms)   | Fwd TeraFLOPS   | Processor       |\n| --------------------------------- | --------------- | -------------- | ---------------- | --------------- | -------------------- | ----------------- | --------------- | --------------- |\n| W = 700, H = 161, C = 1, N = 32   | R = 5, S = 20   | 32             | 0, 0             | 2, 2            | Speech Recognition   | 1.53              | 7.75            | Tesla V100 FP32 |\n| W = 54, H = 54, C = 64, N = 8     | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | Face Recognition     | 0.55              | 10.12           | Tesla V100 FP32 |\n| W = 224, H = 224, C = 3, N = 16   | R = 3, S = 3    | 64             | 1, 1             | 1, 1            | Computer Vision      | 2.40              | 1.40            | Tesla V100 FP32 |\n| W = 7, H = 7,  C = 512, N = 16    | R = 3, S = 3    | 512            | 1, 1             | 1, 1            | Computer Vision      | 0.70              | 14.56           | Tesla V100 Mixed Precision |\n| W = 28, H = 28, C = 192, N = 16   | R = 5, S = 5    | 32             | 2, 2             | 1, 1            | Computer Vision      | 0.93              | 16.90           | Tesla V100 FP32  |\n\n### Recurrent Ops Results\n\nThe recurrent op kernels are only run on NVIDIA hardware.\n\n| Hidden Units   | Batch Size   | TimeSteps   | Recurrent Type   | Application           | Total Time (ms) | Fwd TeraFLOPS   | Processor       |\n| -------------- | ------------ | ----------- | ---------------- | --------------------- | ------------    | --------------- | --------------- |\n| 1760           | 16           | 50          | Vanilla          | Speech Recognition    | 8.21            | 1.19            | Tesla V100 Mixed Precision |\n| 2560           | 32           | 50          | Vanilla          | Speech Recognition    | 10.50           | 4.08            | Tesla V100 Mixed Precision |\n| 1024           | 128          | 25          | LSTM             | Machine Translation   | 5.56            | 10.91           | Tesla V100 Mixed Precision |\n| 2816           | 32           | 1500        | GRU              | Speech Recognition    | 380.04          | 11.85           | Tesla V100 Mixed Precision |\n\n### All-Reduce Results\n\n| Size (# of floats) | Number of Processors | Application        | Time (ms)   | Bandwidth (GB/s) | Processor                           |\n|--------------------|----------------------|--------------------|-------------|------------------|-------------------------------------|\n| 16777216           | 8                    | Speech Recognition | 8.66        | 61.99            | Xeon Phi 7250 with Intel® Omni-Path |\n| 16777216           | 16                   | Speech Recognition | 14.72       | 72.94            | Xeon Phi 7250 with Intel® Omni-Path |\n| 16777216           | 32                   | Speech Recognition | 19          | 113.03           | Xeon Phi 7250 with Intel® Omni-Path |\n| 64500000           | 32                   | Speech Recognition | 76.68       | 107.67           | Xeon Phi 7250 with Intel® Omni-Path |\n\n## Inference Server Results\n\nThe next few sections provide a few results for GEMM, Convolution and Recurrent operations for inference kernels on\nserver platforms. Results on Intel platforms should be available shortly.\n\n### GEMM Results\n\n| Kernel                 | Application        | Results (ms) | TeraFLOPS | Processor |\n|------------------------|--------------------|--------------|-----------|-----------|\n| M=5124, N=700, K=2048  | Speech Recognition | 0.46         | 31.94     | 1080 Ti   |\n| M=35, N=700, K=2048    | Speech Recognition | 0.05         | 2.09      | 1080 Ti   |\n| M=3072, N=3000, K=1024 | Speech Recognition | 0.49         | 38.36     | Titan Xp  |\n| M=512, N=6000, K=2816  | Speech Recognition | 0.43         | 40.71     | Titan Xp  |\n\n### Sparse GEMM Results\n\n| Kernel                 | Sparsity | Application        | Results (ms) | Speedup wrt dense | TeraFLOPS | Processor |\n|------------------------|----------|--------------------|--------------|-------------------|-----------|-----------|\n| M=7680, N=1, K=2560    | 0.95     |Speech Recognition  | 0.03         | 6.56              | 1.10      | 1080 Ti   |\n| M=7680, N=2, K=2560    | 0.95     |Speech Recognition  | 0.04         | 5.93              | 1.74      | 1080 Ti   |\n| M=7680, N=1500, K=2560 | 0.95     |Speech Recognition  | 29.81        | 0.16              | 1.88      | TitanXp   |\n| M=10752, N=1, K=3584   | 0.9      | Speech Recognition | 0.1          | 4                 | 0.72      | TitanXp   |\n\n### Convolution Results\n\n| Input Size                     | Filter Size   | # of Filters | Padding (h, w) | Stride (h, w) | Application        | Time (ms) | TeraFLOPS | Processor     |\n|--------------------------------|---------------|--------------|----------------|---------------|--------------------|-----------|-----------|---------------|\n| W = 341, H = 79, C = 32, N = 4 | R = 5, S = 10 | 32           | 0,0            | 2,2           | Speech Recognition | 0.29      | 9.03      | TitanXp       |\n| W = 224, H = 224, C = 3, N = 1 | R = 7, S = 7  | 64           | 3, 3           | 2, 2          | Computer Vision    | 0.14      | 1.64      | TitanXp       |\n| W = 56, H = 56, C = 256, N = 1 | R = 1, S = 1  | 128          | 0, 0           | 2, 2          | Computer Vision    | 0.015     | 3.43      | TitanX Pascal |\n| W = 7, H = 7,  C = 512, N = 2  | R = 1, S = 1  | 2048         | 0, 0           | 1, 1          | Computer Vision    | 0.018     | 11.42     | 1080 Ti       |\n\n### RNN Results\n\n| Hidden Units | Batch Size | TimeSteps | Recurrent Type | Application                  | Results (ms) | Fwd TeraFLOPS | Processor |\n|--------------|------------|-----------|----------------|------------------------------|------------|---------------|-----------|\n| 1536         | 4          | 50        | LSTM           | Language Modelling           |   6.93      |  0.55             |  TitanXp         |\n| 256          | 4          | 150       | LSTM           | Character Language Modelling |   1.63         |  0.19             |   1080 Ti        |\n| 2816         | 1          | 1500      | GRU            | Speech Recognition           |   350.62         |  0.20             | TitanXp          |\n| 2560    |   2         |   375        |  GRU              | Speech Recognition       |  75.02          |     0.39 | TitanXp          |\n\n## Inference Device Results\n\n### GEMM Results\n\n| Kernel                 |  Application        | Results (ms) | GigaFLOPS | Processor     |\n|------------------------|--------------------|--------------|-----------|---------------|\n| M=5124, N=700, K=2048  | Speech Recognition | 212.84         | 69.03      | iPhone 7 |\n| M=35, N=700, K=2048   | Speech Recognition | 1.94         | 51.69      | iPhone 7 |\n| M=3072, N=1500, K=1024 | Speech Recognition | 136.63         | 69.07      | iPhone 7 |\n\n### Sparse GEMM Results\n\n| Kernel                 | Sparsity | Application        | Results (ms) | Speedup wrt dense | GigaFLOPS | Processor |\n|------------------------|----------|--------------------|--------------|-------------------|-----------|-----------|\n| M=7680, N=1, K=2560    | 0.95     |Speech Recognition  | 1.01         | 15.55             | 18.55     | iPhone 7  |\n| M=7680, N=1500, K=2560 | 0.95     |Speech Recognition  | 1677.36      | 5.46              | 16.70     | iPhone 7  |\n| M=7680, N=1, K=2560    | 0.9      | Speech Recognition | 2.1          | 8.02              | 8.41      | iPhone 7  |\n\n\n### Convolution Results\n\n| Input Size                      | Filter Size  | # of Filters | Padding (h, w) | Stride (h, w) | Application     | Time (ms) | GigaFLOPS | Processor      |\n|---------------------------------|--------------|--------------|----------------|---------------|-----------------|-----------|-----------|----------------|\n| W = 112, H = 112, C = 64, N = 1 | R = 1, S = 1 | 64           | 0, 0           | 1, 1          | Computer Vision | 670.75    | 0.15      | Raspberry Pi 3 |\n| W = 56, H = 56, C = 256, N = 1  | R = 1, S = 1 | 128          | 0, 0           | 2, 2          | Computer Vision | 185.87    | 0.28      | Raspberry Pi 3 |\n| W = 7, H = 7,  C = 512, N = 1   | R = 1, S = 1 | 2048         | 0, 0           | 1, 1          | Computer Vision | 735.28    | 0.14      | Raspberry Pi 3 |\n\n# Get Involved\n\nWe welcome contributions from the community to DeepBench. You can contribute in two ways:\n\n1. Deep Learning Researchers/Engineers: If you are deep learning researcher or engineer working on a new deep learning application, you may have different operations and/or workloads involved in training your model. We are interested in learning more about the underlying operations that are adversely impacting the performance (speed) of your model. Please contribute these operations and workloads!\n2. Hardware Vendors: We would gladly accept contributions from other hardware vendors. We're open to accepting benchmark results from large companies or smaller startups building hardware for training deep learning models. Please contribute benchmark results for your hardware!\n\n# Getting the Code\nTo get the code, simply clone the github repo\n\n```\ngit clone https://github.com/baidu-research/DeepBench\n```\n\n# NVIDIA Benchmarks\n## Compiling\n\nIn order to build the benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 1.10.2.\nCUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 7.5.18.\nCUDNN_PATH: Path to CUDNN library. The benchmarks have been tested with version 5.0.\nNCCL_PATH: Path to NCCL library. NCCL library is available at https://github.com/NVIDIA/nccl. The benchmarks have been tested with commit b3a9e1333d9e2e1b8553b5843ba1ba4f7c79739d\n```\n\nTo build all the benchmarks, please use the following command:\n```\ncd code/\nmake CUDA_PATH=\u003ccuda_path\u003e CUDNN_PATH=\u003ccudnn_path\u003e MPI_PATH=\u003cmpi_path\u003e NCCL_PATH=\u003cnccl_path\u003e\n```\n\nFor distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:\n\n```\nMPI_INCLUDE_PATH=\u003cmpi_include_path\u003e\n```\n\nYou need to build the code for the appropriate architecture. By default, the architecture version is set to 5.2. This works for the TitanX and Tesla M40 GPU. In order build the benchmark for another architecture (such as Pascal with version 6.1), please append the following variable to the `make` command:\n\n```\nARCH=sm_61 ## Just an example for Pascal architecture\n```\n\nIn some cases, it may be useful to generate benchmarking executables for multiple architectures. For example, some systems may have multiple graphics processors with different architectures installed. The NVIDIA compiler (nvcc) supports the generation of \"fat binaries\" that contain intermediate and compiled code for multiple target architectures. To compile for multiple architectures, add a comma separated list of architectures to the `make` command line.\n\n```\nARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70     # Everything since Kepler!\n```\nNote that compilation for multiple architectures will take longer than compilation for a single architecture. Also, not all CUDA versions support all architectures. For example, support for sm_60 (and later) require CUDA 8 or later.\n\n\nFor inference problems with `int8` precision, the convolution and gemm kernels need to be padded to be multiples of 4. By default, the kernels are padded and results are reported with padding. To disable padding, please use the following build option. When padding is disabled, the benchmark numbers aren't reported for the kernels that aren't supported. \n\n```\nmake gemm PAD_KERNELS=0\nmake conv PAD_KERNELS=0\n```\n\nIn order to use Tensor Cores on NVIDIA's V100 processor, you need to use CUDA 9.0 and cudNN 7.0 or higher. Using the correct libraries, add the following option to the make command:\n\n```\nmake USE_TENSOR_CORES=1 ARCH=sm_70\n```\nConvolution operations running Tensor Cores need input and output channels to be a multiple of 8. The benchmarks currently pad the input channels to be a multiple of 8 and report padded numbers.\n\n## Running the Benchmarks\n\nOnce compilation completes successfully, the executables will be\ngenerated in the `bin` directory. Before executing the benchmarks, it\nis important to set your `LD_LIBRARY_PATH` correctly. For bash shells,\nplease use:\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003ccuda_path\u003e:\u003ccudnn_path\u003e:\u003cmpi_path\u003e:\u003cnccl_path\u003e\n```\n\n\nThe GEMM, convolution, recurrent op and sparse GEMM benchmarks can be run by calling\nthe respective executables. Here is some of the output from the GEMM \nbenchmark:\n\n```\n~/DeepBench/code$ bin/gemm_bench\n                  Running training benchmark \n                         Times\n----------------------------------------------------------------------------------------\n    m       n      k      a_t     b_t      precision  time (usec) \n   1760     16   1760      0      0        float          180\n   1760     32   1760      0      0        float          182\n   1760     64   1760      0      0        float          247\n   1760    128   1760      0      0        float          318\n```\n\nBy default, the benchmarks are run with training problems. The default \nprecision for benchmarking is determined based on the CUDA and cudnn \nlibrary versions. The mode (inference or training) and precision can be specified on the command line \nusing: \n\n```\nbin/gemm_bench \u003cinference|train\u003e \u003cint8|float|half\u003e\n```\n\nEach of the benchmark files includes a note indicating which precision is\nsupported for different GPUs. \n\nTo execute the NCCL single All-Reduce benchmark, you need to specify\nthe number of GPUs as an argument. Please note that the number of GPUs\nmust not be greater than the number of GPUs visible in your system.\n\n```\nbin/nccl_single_all_reduce \u003cnum_gpus\u003e\n```\n\nThe NCCL MPI All-Reduce benchmark can be run using `mpirun` as shown below:\n\n```\nmpirun -np \u003cnum_ranks\u003e bin/nccl_mpi_all_reduce\n```\n`num_ranks` cannot be greater than the number of GPUs in the system.\n\nThe `osu_allreduce` benchmark can be executed using mpirun as follows:\n```\nmpirun -np \u003cnum_processes\u003e bin/osu_allreduce\n```\n\nThe `osu_allreduce` benchmark can be run with more processes than\nGPUs. However, all our experiments were conducted with each process\nrunning on a single GPU.\n\n# Baidu Benchmarks\n## Compiling\n\nIn order to build the benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to MPI library. The benchmarks have been tested with OpenMPI version 2.0.1.\nCUDA_PATH: Path to CUDA library. The benchmarks have been tested with version 8.0.61.\nBAIDU_ALLREDUCE_PATH: Path to Baidu's allreduce implementation, which is avaiable at https://github.com/baidu-research/baidu-allreduce/.\n```\n\nTo build all the benchmarks, please use the following command:\n```\ncd code/\nmake CUDA_PATH=\u003ccuda_path\u003e MPI_PATH=\u003cmpi_path\u003e BAIDU_ALLREDUCE_PATH=\u003cbaidu_allreduce_path\u003e\n```\n\nFor distributions that split their MPI headers and libraries (e.g. RHEL, Fedora, CentOS) into separate directories you should also specify the path to the include files:\n\n```\nMPI_INCLUDE_PATH=\u003cmpi_include_path\u003e\n```\n\nPlease set the ARCH paramter for appropriate architecture as discussed above in the NVIDIA Benchmarks section.\n\n## Running the Benchmarks\n\nOnce compilation completes successfully, the executables will be\ngenerated in the `bin` directory. Before executing the benchmarks, it\nis important to set your `LD_LIBRARY_PATH` correctly. For bash shells,\nplease use:\n\n```\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003ccuda_path\u003e:\u003cmpi_path\u003e:\u003cbaidu_allreduce_path\u003e\n```\n\nThe Baidu All-Reduce benchmark can be run using `mpirun` as shown below:\n\n```\nmpirun -np \u003cnum_ranks\u003e bin/ring_all_reduce\n```\n`num_ranks` is used as the total number of GPUs in the system.\n\n# Intel Benchmarks\n# Compiling and Running the Benchmarks\n\nSource all the Intel tools (icc, mkl, mpi) into the path\n\n```\nsource \u003cicc_installdir\u003e/bin/compilervars.sh intel64\nsource \u003cmkl_installdir\u003e/bin/mklvars.sh intel64\nsource \u003cimpi_installdir\u003e/bin/mpivars.sh intel64\nsource \u003cmlsl_installdir\u003e/intel64/bin/mlslvars.sh\n```\n\nRunning the Intel GEMM benchmark (MKL 2017)\n\n```\ncode/intel/sgemm/run_mkl_sgemm_ia.sh\n```\n\nRunning the Intel convolution benchmark (MKL 2017 and libxsmm (open\nsource KNL optimized convolution implementation))\n\n```\ncode/intel/convolution/run_conv_ia.sh\n```\n\nThe Intel All-Reduce benchmarks use the standard OSU benchmark compiled/running with Intel MPI or with Intel MLSL.\n\nIn order to build the Intel All-Reduce benchmarks, you will need to specify the following paths:\n```\nMPI_PATH: Path to Intel MPI library ($I_MPI_ROOT by default). The benchmarks have been tested with Intel MPI 2017 Update 3.\nMLSL_PATH: Path to Intel MLSL library ($MLSL_ROOT by default). The benchmarks have been tested with Intel MLSL 2017 Update 2 Preview.\n```\nand use \"Makefile_ia\" makefile.\n\nFor example (building with default paths):\n```\nmake -f Makefile_ia all\n```\n\nRunning the Intel All-Reduce benchmarks:\n```\ncode/osu_allreduce/run_allreduce_ia.sh \u003chostfile\u003e \u003callreduce_binary\u003e\n```\n\nThere are 2 possible values for \u003callreduce_binary\u003e:\n* osu_allreduce - benchmark for blocking All-Reduce over MPI\n* mlsl_osu_allreduce - benchmark for blocking All-Reduce over MLSL\n\nThe performance of blocking All-Reduce over MLSL is reported in DeepBench result files.\n\nFor example, to run All-Reduce benchmark over MLSL create hostfile with one hostname per line\nand run script as following:\n```\ncode/osu_allreduce/run_allreduce_ia.sh \u003chostfile\u003e mlsl_osu_allreduce\n```\nScript will run benchmark on different scales (2, 4, 8, 16, 32 nodes) and on DeepBench specific message sizes.\nBenchmark will report average latency metric.\n\nFor example, benchmark output on 32 KNL/OPA nodes:\n```\n# Size         Avg Latency(ms)\n100000                    0.31\n3097600                   3.59\n4194304                   4.67\n6553600                   7.17\n16777217                 16.80\n38360000                 56.65\n64500000                 75.77\n```\n\n# ARM Benchmarks\n\nThe ARM benchmarks in DeepBench are compiled and run on 64 bit ARM v8 processors. \nThe `Makefile` in the `code/arm` folder only supports this processor. In order to benchmark\nother processors, you will have to modify the `Makefile` to support them. \n\n## GEMM Benchmark\n\nThe ARM GEMM benchmark uses the [Gemmlowp](https://github.com/google/gemmlowp) library\nfor `int8` kernels. This library is included as a submodule in the DeepBench repository. \nTo build and run the benchmark, simply run:\n```\n./run_gemm_bench.sh\n```\n\n## Convolution Benchmark\nThe ARM Convolution benchmark uses the [ARM Compute Library](https://github.com/ARM-software/ComputeLibrary).\nTo build the benchmark, you need to specify the include and lib paths for ARM compute library:\n```\nARM_COMPUTE_INCLUDE_PATH: Path to ARM Compute Library \nARM_COMPUTE_LIB_PATH: Path to ARM Compute library binary\n```\nTo build and run the benchmark, please use:\n```\nmake conv ARM_COMPUTE_INCLUDE_PATH=\u003cpath\u003e ARM_COMPUTE_LIB_PATH=\u003cpath\u003e\nexport LD_LIBRARY_PATH=$LD_LIBRARY_PATH:\u003cpath to arm compute library binary\u003e\nbin/conv_bench\n```\n\n## Sparse GEMM Benchmark\nThe Sparse GEMM Benchmark uses the [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) library. \nTo build the benchmark, you need to download the eigen library and specify the path:\n```\nEIGEN_PATH: path to Eigen library\n```\n\nTo compile and run the benchmark, please use the following command:\n```\nmake sparse EIGEN_PATH=\u003cpath\u003e\nbin/sparse_bench\n```\n\n# AMD Benchmarks\n\n## Prerequisites\n* A ROCm enabled platform, more info [here](https://rocm.github.io/install.html).\n* [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) - HIP backend of MIOpen is required.\n* [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS)\n\nAt present only `fp32 train` benchmarks are enabled.\n\n## Compiling\n\nThe `Makefile` in `code/amd` is for an AMD `gfx900` GPU. To benchmark other generations, please modify the `Makefile` accordingly.\n\nSetting your enviroment variables before compiling/running:\n\n```\nexport PATH=PATH_TO_ROCM/bin:$PATH\nexport CPATH=PATH_TO_MIOPEN/include:$CPATH\nexport LIBRARY_PATH=PATH_TO_MIOPEN/lib:$LIBRARY_PATH\nexport LD_LIBRARY_PATH=PATH_TO_MIOPEN/lib:PATH_TO_MIOPENGEMM/lib:$LD_LIBRARY_PATH\n```\n\nTo compile the convolution, RNNs and GEMM benchmarks, run:\n\n```\nmake conv rnn gemm\n```\n\n## Running the Benchmarks\nAfter successful compilation, the executables will be generated in the `bin` directory.\n\nTo benchmark convolutions:\n```\nbin/conv_bench\n```\n\nTo benchmark RNN:\n```\nbin/rnn_bench\n```\n\nTo benchmark GEMM:\n```\nbin/gemm_bench\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu-research%2FDeepBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaidu-research%2FDeepBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaidu-research%2FDeepBench/lists"}