{"id":13442812,"url":"https://github.com/sandeep-krishnamurthy/dl-operator-benchmark","last_synced_at":"2025-03-20T15:31:02.812Z","repository":{"id":152321520,"uuid":"181583472","full_name":"sandeep-krishnamurthy/dl-operator-benchmark","owner":"sandeep-krishnamurthy","description":"Framework for benchmarking deep learning operators for Apache MXNet","archived":false,"fork":false,"pushed_at":"2019-05-09T22:27:39.000Z","size":162,"stargazers_count":6,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-08-01T03:42:00.939Z","etag":null,"topics":["apache-mxnet","benchmarking-framework","deep-learning","performance"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sandeep-krishnamurthy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-04-16T00:10:52.000Z","updated_at":"2022-11-22T04:00:52.000Z","dependencies_parsed_at":"2023-05-30T01:45:14.578Z","dependency_job_id":null,"html_url":"https://github.com/sandeep-krishnamurthy/dl-operator-benchmark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandeep-krishnamurthy%2Fdl-operator-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandeep-krishnamurthy%2Fdl-operator-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandeep-krishnamurthy%2Fdl-operator-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sandeep-krishnamurthy%2Fdl-operator-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sandeep-krishnamurthy","download_url":"https://codeload.github.com/sandeep-krishnamurthy/dl-operator-benchmark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221772566,"owners_count":16878130,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-mxnet","benchmarking-framework","deep-learning","performance"],"created_at":"2024-07-31T03:01:51.455Z","updated_at":"2024-10-28T03:31:06.693Z","avatar_url":"https://github.com/sandeep-krishnamurthy.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# DL Operator Benchmarks\nA Python framework for benchmarking operators in [Apache MXNet Deep Learning Library](http://mxnet.incubator.apache.org/).\n\n## Features\n\n1. Individual operator benchmarks to capture \"Speed\" - operator execution time for `Forward`, `Backward` or `Both Forward and Backward` operations.\n2. Individual operator benchmarks to capture \"Memory\" usage. (TODO - Coming Soon...)\n3. Benchmarks for commonly fused operators. Ex: Conv + Relu, Conv + BatchNorm. (TODO - Coming Soon...)\n4. Benchmarks for operators with varying inputs to uncover any performance issues due to skewed input data. Example: Measuring operator performance on small input tensors, large input tensors along with average normally used tensor sizes.\n5. Support running all or subset of operator benchmarks.\n6. Support running operator benchmarks with reasonable default inputs or customize the inputs to the operators. Example: By default, use (1024, 1024) Float32 tensor for Add operator or allow users to specify dtype to be Float64, tensors of shape (10, 100).\n7. Support exporting benchmarks results in different formats - stdout (console output), dictionary output, write to JSON, Markdown or CSV files.\n\n\nCurrently supports benchmarking [`NDArray`](http://mxnet.incubator.apache.org/api/python/ndarray/ndarray.html) operators and [`Gluon`](http://mxnet.incubator.apache.org/api/python/gluon/gluon.html) blocks (layers) in MXNet.\n\n## Motivation\n\nBenchmarks are usually done end-to-end for a given Network Architecture. For example: ResNet-50 benchmarks on ImageNet data. This is good measurement of overall performance and health of a deep learning framework. However, it is important to note the following important factors:\n1. Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.   \n2. A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense and more. Consider the following scenarios:\n    1. We improved the performance of Convolution2D operator, but due to a bug, Softmax performance went down. Overall, we may observe end to end benchmarks are running fine, we may miss out the performance degradation of a single operator which can accumulate and become untraceable.\n    2. You need to see in a given network, which operator is taking maximum time and plan optimization work. With end to end benchmarks, it is hard to get more fine grained numbers at operator level.\n3. We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators performs. With these details, we can plan the optimization work at operator level, which could exponentially boost up end to end performance.\n4. You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early. \n5. We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Example: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.\n\nHence, in this framework, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators.\n\n## How to use\n\n### Pre-Requisites\n\n1. MXNet\n2. Python3\n\n\n```bash\n# Install the version of MXNet to be tested\npip install mxnet       # For CPU (By default comes with MKLDNN)\npip install mxnet-cu10  # For GPU with CUDA 9.2\n\n# Clone the operator benchmark library\ngit clone https://github.com/sandeep-krishnamurthy/dl-operator-benchmark\n```\n\n\n### Run benchmarks for all the operators\n\nBelow command runs all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.\n\n```\npython dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json\n\n```\n\n**Other Options:**\n\n1. **output-format** : `json` or `md` for markdown file output or `csv`.\n\n2. **ctx** : By default, `cpu` on CPU machine, `gpu(0)` on GPU machine. You can override and set the global context for all operator benchmarks. Example: `--ctx gpu(2)`.\n\n3. **dtype** : By default, `float32`. You can override and set the global dtype for all operator benchmarks. Example: `--dtype float64`.\n\n### Run benchmarks for all the operators in a specific category\n\nFor example, you want to run benchmarks for all `NDArray Arithmetic Operators`, you just run the following python script.\n\n```python\n#! /usr/bin/python\nfrom mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks\n\n# Run all Arithmetic operations benchmarks with default input values\nrun_all_arithmetic_operations_benchmarks()\n\n```\n\nOutput for the above benchmark run, on a CPU machine, would look something like below:\n\n```\nMX_Add_Forward_Backward_Time - 0.015201 seconds\nMX_Multiply_Forward_Backward_Time - 0.021678 seconds\nMX_Subtract_Forward_Backward_Time - 0.016154 seconds\nMX_Divide_Forward_Backward_Time - 0.024327 seconds\nMX_Modulo_Forward_Backward_Time - 0.045726 seconds\nMX_Power_Forward_Backward_Time - 0.077152 seconds\nMX_Negative_Forward_Backward_Time - 0.014472 seconds\nMX_Inplace_Add_Forward_Time - 0.003824 seconds\nMX_Inplace_Subtract_Forward_Time - 0.004137 seconds\nMX_Inplace_Multiply_Forward_Time - 0.006589 seconds\nMX_Inplace_Division_Forward_Time - 0.003869 seconds\nMX_Inplace_Modulo_Forward_Time - 0.018180 seconds\n```\n\n### Run benchmarks for specific operator\n\nFor example, you want to run benchmarks for `nd.add` operator in MXNet, you just run the following python script.\n\n#### CASE 1 - Default Inputs for Operators\n\n```python\n#! /usr/bin/python\nfrom mxnet_benchmarks.nd import Add\n\n# Run all Arithmetic operations benchmarks with default input values\nadd_benchmark = Add()\nadd_benchmark.run_benchmark()\nadd_benchmark.print_benchmark_results()\n\n```\n\nOutput for the above benchmark run, on a CPU machine, would look something like below:\n\n```\nMX_Add_Forward_Backward_Time - 0.015201 seconds\n```\n\n#### CASE 2 - Customize Inputs for Operators\n\nIn this case, let us assume, you want to run benchmarks on a `float64` tensor instead of a default `float32`.\n\n```python\n#! /usr/bin/python\nfrom mxnet_benchmarks.nd import Add\n\n# Run all Arithmetic operations benchmarks with default input values\nadd_benchmark = Add(inputs={\"dtype\": \"float64\"})\nadd_benchmark.run_benchmark()\nadd_benchmark.print_benchmark_results()\n\n```\n\nOutput for the above benchmark run, on a CPU machine, would look something like below:\n\n```\nMX_Add_Forward_Backward_Time - 0.025405 seconds\n```\n\n**NOTE:** You can print the input parameters used for a benchmark as shown below.\n\n```python\nfrom mxnet_benchmarks.nd import Add\n\n# Run all Arithmetic operations benchmarks with default input values\nadd_benchmark = Add(inputs={\"dtype\": \"float64\"})\nprint(add_benchmark.inputs)\n```\n\nOutput:\n```\n{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': \u003cfunction normal at 0x117b607b8\u003e, 'run_backward': True, 'dtype': 'float64'}\n```\n\n## Future Development\n\n1. Logging\n2. Currently around 134 MXNet operators (out of around 250) are supported for benchmarks. Help add more operators support.\n2. Add support for Memory profiling and benchmarking.\n3. Support more complex operator structure for benchmarking. Example: Fused operator - Conv + BatchNorm, Conv + Relu etc.\n4. Integration with MXNet profiler to get more fine grained profiling results such as eliminate Python layer overhead, pure forward only timing, backward only timing.\n5. In future, we plan to support PyTorch and other deep learning libraries to help users compare individual operator performance across frameworks.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsandeep-krishnamurthy%2Fdl-operator-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsandeep-krishnamurthy%2Fdl-operator-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsandeep-krishnamurthy%2Fdl-operator-benchmark/lists"}