{"id":19352019,"url":"https://github.com/lukashedegaard/pytorch-benchmark","last_synced_at":"2025-10-08T19:58:58.785Z","repository":{"id":43706313,"uuid":"457366382","full_name":"LukasHedegaard/pytorch-benchmark","owner":"LukasHedegaard","description":"Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption ","archived":false,"fork":false,"pushed_at":"2023-08-25T09:38:10.000Z","size":88,"stargazers_count":99,"open_issues_count":3,"forks_count":11,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-11T21:58:51.257Z","etag":null,"topics":["benchmark","deep-learning","flops","gpu","jetson","python","pytorch","timing-analysis"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LukasHedegaard.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-02-09T13:17:50.000Z","updated_at":"2025-04-11T14:52:50.000Z","dependencies_parsed_at":"2024-01-18T04:51:50.455Z","dependency_job_id":"fa676282-3c07-4a9a-b6e0-4364299704fc","html_url":"https://github.com/LukasHedegaard/pytorch-benchmark","commit_stats":{"total_commits":51,"total_committers":2,"mean_commits":25.5,"dds":"0.17647058823529416","last_synced_commit":"e2e148109fedf82e64e50232b3421c95021f9ef5"},"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fpytorch-benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fpytorch-benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fpytorch-benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LukasHedegaard%2Fpytorch-benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LukasHedegaard","download_url":"https://codeload.github.com/LukasHedegaard/pytorch-benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250391153,"owners_count":21422849,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","deep-learning","flops","gpu","jetson","python","pytorch","timing-analysis"],"created_at":"2024-11-10T04:37:53.552Z","updated_at":"2025-10-08T19:58:53.733Z","avatar_url":"https://github.com/LukasHedegaard.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ⏱ pytorch-benchmark\n__Easily benchmark model inference FLOPs, latency, throughput, max allocated memory and energy consumption__\n\u003cdiv align=\"left\"\u003e\n  \u003ca href=\"https://pypi.org/project/pytorch-benchmark/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/pyversions/pytorch-benchmark\" height=\"20\" \u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://badge.fury.io/py/pytorch-benchmark\"\u003e\n    \u003cimg src=\"https://badge.fury.io/py/pytorch-benchmark.svg\" height=\"20\" \u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/pytorch-benchmark\"\u003e\n    \u003cimg src=\"https://static.pepy.tech/badge/pytorch-benchmark\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://www.codefactor.io/repository/github/lukashedegaard/pytorch-benchmark/overview/main\"\u003e\n    \u003cimg src=\"https://www.codefactor.io/repository/github/lukashedegaard/pytorch-benchmark/badge/main\" alt=\"CodeFactor\" /\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://opensource.org/licenses/Apache-2.0\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-blue.svg\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/psf/black\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://codecov.io/gh/LukasHedegaard/pytorch-benchmark\"\u003e\n    \u003cimg src=\"https://codecov.io/gh/LukasHedegaard/pytorch-benchmark/branch/main/graph/badge.svg?token=B91XGSKSFJ\"/\u003e\n  \u003c/a\u003e\n   \u003csup\u003e*\u003c/sup\u003e\n\u003c/div\u003e\n\n###### \\*Actual coverage is higher as GPU-related code is skipped by Codecov\n\n## Install \n```bash\npip install pytorch-benchmark\n```\n\n## Usage \n```python\nimport torch\nfrom torchvision.models import efficientnet_b0\nfrom pytorch_benchmark import benchmark\n\n\nmodel = efficientnet_b0().to(\"cpu\")  # Model device sets benchmarking device\nsample = torch.randn(8, 3, 224, 224)  # (B, C, H, W)\nresults = benchmark(model, sample, num_runs=100)\n```\n\n### Sample results 💻\n\u003cdetails\u003e\n  \u003csummary\u003eMacbook Pro (16-inch, 2019), 2.6 GHz 6-Core Intel Core i7\u003c/summary\u003e\n  \n  ```\n  device: cpu\n  flops: 401669732\n  machine_info:\n    cpu:\n      architecture: x86_64\n      cores:\n        physical: 6\n        total: 12\n      frequency: 2.60 GHz\n      model: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz\n    gpus: null\n    memory:\n      available: 5.86 GB\n      total: 16.00 GB\n      used: 7.29 GB\n    system:\n      node: d40049\n      release: 21.2.0\n      system: Darwin\n  params: 5288548\n  timing:\n    batch_size_1:\n      on_device_inference:\n        human_readable:\n          batch_latency: 74.439 ms +/- 6.459 ms [64.604 ms, 96.681 ms]\n          batches_per_second: 13.53 +/- 1.09 [10.34, 15.48]\n        metrics:\n          batches_per_second_max: 15.478907181264278\n          batches_per_second_mean: 13.528026359855625\n          batches_per_second_min: 10.343281300091244\n          batches_per_second_std: 1.0922382209314958\n          seconds_per_batch_max: 0.09668111801147461\n          seconds_per_batch_mean: 0.07443853378295899\n          seconds_per_batch_min: 0.06460404396057129\n          seconds_per_batch_std: 0.006458734193132054\n    batch_size_8:\n      on_device_inference:\n        human_readable:\n          batch_latency: 509.410 ms +/- 30.031 ms [405.296 ms, 621.773 ms]\n          batches_per_second: 1.97 +/- 0.11 [1.61, 2.47]\n        metrics:\n          batches_per_second_max: 2.4673319862230025\n          batches_per_second_mean: 1.9696935126370148\n          batches_per_second_min: 1.6083039834656554\n          batches_per_second_std: 0.11341204895590185\n          seconds_per_batch_max: 0.6217730045318604\n          seconds_per_batch_mean: 0.509410228729248\n          seconds_per_batch_min: 0.40529608726501465\n          seconds_per_batch_std: 0.030031445467788704\n  ```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eServer with NVIDIA GeForce RTX 2080 and Intel Xeon 2.10GHz CPU\u003c/summary\u003e\n  \n  ```\n  device: cuda\n  flops: 401669732\n  machine_info:\n    cpu:\n      architecture: x86_64\n      cores:\n        physical: 16\n        total: 32\n      frequency: 3.00 GHz\n      model: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz\n    gpus:\n    - memory: 8192.0 MB\n      name: NVIDIA GeForce RTX 2080\n    - memory: 8192.0 MB\n      name: NVIDIA GeForce RTX 2080\n    - memory: 8192.0 MB\n      name: NVIDIA GeForce RTX 2080\n    - memory: 8192.0 MB\n      name: NVIDIA GeForce RTX 2080\n    memory:\n      available: 119.98 GB\n      total: 125.78 GB\n      used: 4.78 GB\n    system:\n      node: monster\n      release: 4.15.0-167-generic\n      system: Linux\n  max_inference_memory: 736250368\n  params: 5288548\n  post_inference_memory: 21402112\n  pre_inference_memory: 21402112\n  timing:\n    batch_size_1:\n      cpu_to_gpu:\n        human_readable:\n          batch_latency: \"144.815 \\xB5s +/- 16.103 \\xB5s [136.614 \\xB5s, 272.751 \\xB5\\\n            s]\"\n          batches_per_second: 6.96 K +/- 535.06 [3.67 K, 7.32 K]\n        metrics:\n          batches_per_second_max: 7319.902268760908\n          batches_per_second_mean: 6962.865857677197\n          batches_per_second_min: 3666.3496503496503\n          batches_per_second_std: 535.0581873859935\n          seconds_per_batch_max: 0.0002727508544921875\n          seconds_per_batch_mean: 0.00014481544494628906\n          seconds_per_batch_min: 0.0001366138458251953\n          seconds_per_batch_std: 1.6102982159292097e-05\n      gpu_to_cpu:\n        human_readable:\n          batch_latency: \"106.168 \\xB5s +/- 17.829 \\xB5s [53.167 \\xB5s, 248.909 \\xB5\\\n            s]\"\n          batches_per_second: 9.64 K +/- 1.60 K [4.02 K, 18.81 K]\n        metrics:\n          batches_per_second_max: 18808.538116591928\n          batches_per_second_mean: 9639.942102368092\n          batches_per_second_min: 4017.532567049808\n          batches_per_second_std: 1595.7983033708472\n          seconds_per_batch_max: 0.00024890899658203125\n          seconds_per_batch_mean: 0.00010616779327392578\n          seconds_per_batch_min: 5.316734313964844e-05\n          seconds_per_batch_std: 1.7829135190772566e-05\n      on_device_inference:\n        human_readable:\n          batch_latency: \"15.567 ms +/- 546.154 \\xB5s [15.311 ms, 19.261 ms]\"\n          batches_per_second: 64.31 +/- 1.96 [51.92, 65.31]\n        metrics:\n          batches_per_second_max: 65.31149174711928\n          batches_per_second_mean: 64.30692850265713\n          batches_per_second_min: 51.918698784442846\n          batches_per_second_std: 1.9599322351815833\n          seconds_per_batch_max: 0.019260883331298828\n          seconds_per_batch_mean: 0.015567030906677246\n          seconds_per_batch_min: 0.015311241149902344\n          seconds_per_batch_std: 0.0005461537255227954\n      total:\n        human_readable:\n          batch_latency: \"15.818 ms +/- 549.873 \\xB5s [15.561 ms, 19.461 ms]\"\n          batches_per_second: 63.29 +/- 1.92 [51.38, 64.26]\n        metrics:\n          batches_per_second_max: 64.26476266356143\n          batches_per_second_mean: 63.28565696640637\n          batches_per_second_min: 51.38378232692614\n          batches_per_second_std: 1.9198343850767468\n          seconds_per_batch_max: 0.019461393356323242\n          seconds_per_batch_mean: 0.01581801414489746\n          seconds_per_batch_min: 0.015560626983642578\n          seconds_per_batch_std: 0.0005498731526138171\n    batch_size_8:\n      cpu_to_gpu:\n        human_readable:\n          batch_latency: \"805.674 \\xB5s +/- 157.254 \\xB5s [773.191 \\xB5s, 2.303 ms]\"\n          batches_per_second: 1.26 K +/- 97.51 [434.24, 1.29 K]\n        metrics:\n          batches_per_second_max: 1293.3407338883749\n          batches_per_second_mean: 1259.5653105357776\n          batches_per_second_min: 434.23791282741485\n          batches_per_second_std: 97.51424036939879\n          seconds_per_batch_max: 0.002302885055541992\n          seconds_per_batch_mean: 0.000805673599243164\n          seconds_per_batch_min: 0.0007731914520263672\n          seconds_per_batch_std: 0.0001572538140613121\n      gpu_to_cpu:\n        human_readable:\n          batch_latency: \"104.215 \\xB5s +/- 12.658 \\xB5s [59.605 \\xB5s, 128.031 \\xB5\\\n            s]\"\n          batches_per_second: 9.81 K +/- 1.76 K [7.81 K, 16.78 K]\n        metrics:\n          batches_per_second_max: 16777.216\n          batches_per_second_mean: 9806.840626578907\n          batches_per_second_min: 7810.621973929236\n          batches_per_second_std: 1761.6008872740726\n          seconds_per_batch_max: 0.00012803077697753906\n          seconds_per_batch_mean: 0.00010421514511108399\n          seconds_per_batch_min: 5.9604644775390625e-05\n          seconds_per_batch_std: 1.2658293070174213e-05\n      on_device_inference:\n        human_readable:\n          batch_latency: \"16.623 ms +/- 759.017 \\xB5s [16.301 ms, 22.584 ms]\"\n          batches_per_second: 60.26 +/- 2.22 [44.28, 61.35]\n        metrics:\n          batches_per_second_max: 61.346243290283894\n          batches_per_second_mean: 60.25881046175457\n          batches_per_second_min: 44.27827629162004\n          batches_per_second_std: 2.2193085956672296\n          seconds_per_batch_max: 0.02258443832397461\n          seconds_per_batch_mean: 0.01662288188934326\n          seconds_per_batch_min: 0.01630091667175293\n          seconds_per_batch_std: 0.0007590167680596548\n      total:\n        human_readable:\n          batch_latency: \"17.533 ms +/- 836.015 \\xB5s [17.193 ms, 23.896 ms]\"\n          batches_per_second: 57.14 +/- 2.20 [41.85, 58.16]\n        metrics:\n          batches_per_second_max: 58.16374528511205\n          batches_per_second_mean: 57.140338855126565\n          batches_per_second_min: 41.84762740950632\n          batches_per_second_std: 2.1985066663972677\n          seconds_per_batch_max: 0.023896217346191406\n          seconds_per_batch_mean: 0.01753277063369751\n          seconds_per_batch_min: 0.017192840576171875\n          seconds_per_batch_std: 0.0008360147274630088\n  ```\n\u003c/details\u003e\n\n... Your turn\n\n## How we benchmark\nThe overall flow can be summarized with the diagram shown below (best viewed on GitHub):\n```mermaid\nflowchart TB;\n    A([Start]) --\u003e B\n    B(prepare_samples)\n    B --\u003e C[get_machine_info]\n    C --\u003e D[measure_params]\n    D --\u003e E[warm_up, batch_size=1]\n    E --\u003e F[measure_flops]\n    \n    subgraph SG[Repeat for batch_size 1 and x]\n        direction TB\n        G[measure_allocated_memory]\n        G --\u003e H[warm_up, given batch_size]\n        H --\u003e I[measure_detailed_inference_timing]\n        I --\u003e J[measure_repeated_inference_timing]\n        J --\u003e K[measure_energy]\n    end\n\n    F --\u003e SG\n    SG --\u003e END([End])\n```\n\nUsually, the sample and model don't reside on the same device initially (e.g., a GPU holds the model while the sample is on CPU after being loaded from disk or collected as live data). Accordingly, we measure timing in three parts: `cpu_to_gpu`, `on_device_inference`, and `gpu_to_cpu`, as well as a sum of the three, `total`. Note that the `model.device()` determines the execution device. The inference flow is shown below:\n\n```mermaid\nflowchart LR;\n    A([sample])\n    A --\u003e B[cpu -\u003e gpu]\n    B --\u003e C[model __call__]\n    C --\u003e D[gpu -\u003e cpu]\n    D --\u003e E([result])\n```\n\n## Advanced use\nTrying to benchmark a custom class, which is not a `torch.nn.Module`?\nYou can pass custom functions to `benchmark` as seen in [this example](tests/test_custom_class.py).\n\n\n## Limitations\n- Allocated memory measurements are only available on CUDA devices.\n- Energy consumption can only be measured on NVIDIA Jetson platforms at the moment.\n- FLOPs and parameter count is not support for custom classes.\n\n\n## Acknowledgement\nThis work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871449 (OpenDR).\nIt was developed for benchmarking tools in [OpenDR](https://github.com/opendr-eu/opendr), a non-proprietary toolkit for deep learning based functionalities for robotics and vision.\n\n\n## Citation\nIf you like the tool and use it in research, please consider citing it:\n```bibtex\n@software{hedegaard2022pytorchbenchmark,\n  author = {Hedegaard, Lukas},\n  doi = {10.5281/zenodo.7223585},\n  month = {10},\n  title = {{PyTorch-Benchmark}},\n  version = {0.3.5},\n  year = {2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukashedegaard%2Fpytorch-benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flukashedegaard%2Fpytorch-benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flukashedegaard%2Fpytorch-benchmark/lists"}