{"id":14693318,"url":"https://github.com/NAThompson/performance_tuning_tutorial","last_synced_at":"2025-09-09T21:32:08.911Z","repository":{"id":94214662,"uuid":"304658921","full_name":"NAThompson/performance_tuning_tutorial","owner":"NAThompson","description":"Performance Tuning Tutorial given at Oak Ridge National Laboratory","archived":false,"fork":false,"pushed_at":"2021-05-19T16:20:59.000Z","size":1762,"stargazers_count":178,"open_issues_count":0,"forks_count":19,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-05-06T02:40:22.238Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NAThompson.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-16T14:58:28.000Z","updated_at":"2025-04-28T09:46:38.000Z","dependencies_parsed_at":"2023-04-04T00:33:14.116Z","dependency_job_id":null,"html_url":"https://github.com/NAThompson/performance_tuning_tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NAThompson/performance_tuning_tutorial","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NAThompson%2Fperformance_tuning_tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NAThompson%2Fperformance_tuning_tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NAThompson%2Fperformance_tuning_tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NAThompson%2Fperformance_tuning_tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NAThompson","download_url":"https://codeload.github.com/NAThompson/performance_tuning_tutorial/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NAThompson%2Fperformance_tuning_tutorial/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274366224,"owners_count":25272293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-09T02:00:10.223Z","response_time":80,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-13T02:01:13.153Z","updated_at":"2025-09-09T21:32:08.134Z","avatar_url":"https://github.com/NAThompson.png","language":"C++","funding_links":[],"categories":["C++","Profiling"],"sub_categories":[],"readme":"slidenumbers: true\n\n# Performance Tuning\n\n![inline](figures/logo.svg)\n\nNick Thompson\n\n^ Thanks for coming. First let's give a shoutout to Matt Wolf for putting this tutorial together, and to Barney Maccabe for putting the support behind it to make it happen.\n\n---\n\nSession 1: Using `perf`\n\n[Follow along](https://github.com/NAThompson/performance_tuning_tutorial):\n\n```\n$ git clone https://github.com/NAThompson/performance_tuning_tutorial\n$ cd performance_tuning_tutorial\n$ make\n$ ./dot 100000000\n```\n\n^ This is a tutorial, so definitely follow along. I will be pacing this under the assumption you will be following along, so you'll get bored if you're watching. In addition, at the end of the tutorial we'll do a short quiz, not to stress anyone out, but to solidify the concepts. I hope that'll galvanize us to bring a bit more intensity than usually brought to a six hour training session! If the stakes are too low, we're just gonna waste two good mornings.\n\n^ Please get the notes from github, and attempt to issue the commands.\n\n---\n\n## What is `perf`?\n\n - Performance tools for linux\n - Designed to profile kernel, but can profile userspace apps\n - Sampling based\n - Canonized in linux kernel source code\n\n---\n\n## Installing `perf`: Ubuntu\n\n```bash\n$ sudo apt install linux-tools-common\n$ sudo apt install linux-tools-generic\n$ sudo apt install linux-tools-`uname -r`\n```\n\n^ Installation is pretty easy on Ubuntu.\n\n---\n\n## Installing `perf`: CentOS\n\n```bash\n$ yum install perf\n```\n\n---\n\n## Access `perf`:\n\n`perf` is available on Summit (summit.olcf.ornl.gov), Andes (andes.olcf.ornl.gov) and the SNS nodes (analysis.sns.gov)\n\nI have verified that all the commands of this tutorial work on Andes.\n\n---\n\n## Installing `perf`: Source build\n\n```bash\n$ git clone --depth=1 https://github.com/torvalds/linux.git\n$ cd linux/tools/perf;\n$ make\n$ ./perf\n```\n\n^ I like doing source builds of `perf`. Not only because I often don't have root, but also because `perf` improves over time, so I like to get the latest version. For example, new hardware counters were recently added for the Power9 architecture.\n\n---\n\n## Please do a source build for this tutorial!\n\nA source build is the first step to owning your tools, and will help us all be on the same page.\n\n---\n\n## Ubuntu Dependencies\n\n```\n$ sudo apt install -y bison flex libslang2-dev systemtap-sdt-dev \\\n   libnuma-dev libcap-dev libbabeltrace-ctf-dev libiberty-dev python-dev\n```\n\n---\n\n# `perf_permissions.sh`\n\n```bash\n#!/bin/bash\n\n# Taken from Milian Wolf's talk \"Linux perf for Qt developers\"\nsudo mount -o remount,mode=755 /sys/kernel/debug\nsudo mount -o remount,mode=755 /sys/kernel/debug/tracing\necho \"0\" | sudo tee /proc/sys/kernel/kptr_restrict\necho \"-1\" | sudo tee /proc/sys/kernel/perf_event_paranoid\nsudo chown `whoami` /sys/kernel/debug/tracing/uprobe_events\nsudo chmod a+rw /sys/kernel/debug/tracing/uprobe_events\n```\n\n^ If we have root, we have the ability to extract more information from `perf` traces. Kernel debug symbols are a nice to have, but not a need to have, so if you don't have root, don't fret too much.\n\n---\n\n## `perf` MWE\n\n```bash\n$ perf stat ls\ndata  Desktop  Documents  Downloads  Music  Pictures  Public  Templates  TIS  Videos\n\n Performance counter stats for 'ls':\n\n              2.78 msec task-clock:u              #    0.094 CPUs utilized          \n                 0      context-switches:u        #    0.000 K/sec                  \n                 0      cpu-migrations:u          #    0.000 K/sec                  \n               283      page-faults:u             #    0.102 M/sec                  \n           838,657      cycles:u                  #    0.302 GHz                    \n           584,659      instructions:u            #    0.70  insn per cycle         \n           128,106      branches:u                #   46.109 M/sec                  \n             7,907      branch-misses:u           #    6.17% of all branches        \n\n       0.029630910 seconds time elapsed\n\n       0.000000000 seconds user\n       0.003539000 seconds sys\n```\n\n^ This is the `perf` \"hello world\". You might see something a bit different depending on your architecture and `perf` version.\n\n---\n\n## Why `perf`?\n\nThere are lots of great performance analysis tools (Intel VTune, Score-P, tau, cachegrind), but my opinion is that `perf` should be the first tool you reach for.\n\n---\n\n## Why `perf`?\n\n- No fighting for a license, or install Java runtimes on HPC clusters\n- No need to vandalize source code, or be constrained to work with a set of specific languages\n- Text GUI, so easy to use in terminal and over `ssh`\n\n---\n\n## Why `perf`?\n\n- Available on any Linux system\n- Not limited to x86: works on ARM, RISC-V, PowerPC, Sparc\n- Samples rather than models your program\n- Doesn't slow your program down\n\n^ I was trained in mathematics, and I love learning math because it feels permanent. The situation in computer science is much worse. For example, if no one decides to write a Fortran compiler that targets the new Apple M1 chip, there's no Fortran on the Apple M1! So learning tools which will last is important to me.\n\n^ `perf` is part of the linux kernel, so it has credibility that it will survive for a long time. It also works on any architecture Linux compiles on, so it's widely available. As a sampling profiler, it relies on statistics, not a model of your program.\n\n---\n\n## Why not `perf`?\n\n- Text GUI, so fancy graphics must be generated by post-processing\n- *Only* available on Linux\n- Significant limitations when profiling GPUs\n\n---\n\n### `src/mwe.cpp`\n\n```cpp\n#include \u003ciostream\u003e\n#include \u003cvector\u003e\n\ndouble dot_product(double* a, double* b, size_t n) {\n    double d = 0;\n    for (size_t i = 0; i \u003c n; ++i) {\n        d += a[i]*b[i];\n    }\n    return d;\n}\n\nint main(int argc, char** argv) {\n    if (argc != 2) {\n        std::cerr \u003c\u003c \"Usage: ./dot 10\\n\";\n        return 1;\n    }\n    size_t n = atoi(argv[1]);\n    std::vector\u003cdouble\u003e a(n);\n    std::vector\u003cdouble\u003e b(n);\n    for (size_t i = 0; i \u003c n; ++i) {\n        a[i] = i;\n        b[i] = 1/double(i+3);\n    }\n    double d = dot_product(a.data(), b.data(), n);\n    std::cout \u003c\u003c \"a·b = \" \u003c\u003c d \u003c\u003c \"\\n\";\n}\n```\n\n---\n\n## Running the MWE under `perf`\n\n```bash\n$ g++ src/mwe.cpp\n$ perf stat ./a.out 1000000000\na.b = 1e+09\n\n Performance counter stats for './a.out 1000000000':\n\n         14,881.09 msec task-clock:u              #    0.999 CPUs utilized\n                 0      context-switches:u        #    0.000 K/sec\n                 0      cpu-migrations:u          #    0.000 K/sec\n            17,595      page-faults:u             #    0.001 M/sec\n    39,657,728,345      cycles:u                  #    2.665 GHz                      (50.00%)\n    27,974,789,022      stalled-cycles-frontend:u #   70.54% frontend cycles idle     (50.01%)\n     6,000,965,962      stalled-cycles-backend:u  #   15.13% backend cycles idle      (50.01%)\n    88,999,950,765      instructions:u            #    2.24  insn per cycle\n                                                  #    0.31  stalled cycles per insn  (50.00%)\n    15,998,544,101      branches:u                # 1075.093 M/sec                    (49.99%)\n            37,578      branch-misses:u           #    0.00% of all branches          (49.99%)\n\n      14.892496917 seconds time elapsed\n\n      13.566616000 seconds user\n       1.199643000 seconds sys\n```\n\n^ If you have a different `perf` version, you might see `stalled-cycles:frontend` and `stalled-cycles:backend`.\nStalled frontend cycles are those where instructions could not be decoded fast enough to operate on the data.\nStalled backend cycles are those where data did not arrive fast enough. Backend cycles stall much more frequently than frontend cycles. See [here](https://stackoverflow.com/questions/22165299) for more details.\n\n---\n\n## Learning from `perf stat`\n\n- 2.24 instructions/cycle and large number of stalled frontend cycles means we're probably CPU bound. Right? Right?? (Stay tuned)\n- Our branch miss rate is really good!\n\nBut it's not super informative, nor is it actionable.\n\n---\n\n## Aside on 'frontend-cycles' vs 'backend-cycles'\n\n![inline](figures/frontend_v_backend.png)\n\n[Source](https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html)\n\nThis is how Intel divvies up the \"frontend\" and \"backend\" of the CPU. Frontend is responsible for instruction scheduling and decoding, the backend is for executing instructions and fetching data.\n\n---\n\n\u003e The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, reciprocals, divisions, etc.). The cycles stalled in the front-end are a waste because that means that the Front-End does not feed the Back End with micro-operations. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache. Just-in-time compiled code usually expresses this behavior.\n\n-- [stackoverflow](https://stackoverflow.com/a/29059380/)\n\n---\n\n## Learning from `perf stat`\n\n`perf` is written by kernel developers, so the `perf stat` defaults are for them.\n\nAt ORNL, we're HPC developers, so let's make some changes. What stats do we have available?\n\n---\n\n```\n$ perf list\nList of pre-defined events (to be used in -e):\n\n  branch-misses                                      [Hardware event]\n  cache-misses                                       [Hardware event]\n  cache-references                                   [Hardware event]\n  instructions                                       [Hardware event]\n  task-clock                                         [Software event]\n\n  L1-dcache-load-misses                              [Hardware cache event]\n  L1-dcache-loads                                    [Hardware cache event]\n  LLC-load-misses                                    [Hardware cache event]\n  LLC-loads                                          [Hardware cache event]\n\n  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]\n  cache-references OR cpu/cache-references/          [Kernel PMU event]\n  power/energy-cores/                                [Kernel PMU event]\n  power/energy-pkg/                                  [Kernel PMU event]\n  power/energy-ram/                                  [Kernel PMU event]\n```\n\n^ Every architecture has a different set of PMCs, so this list will be different for everyone. I like the `power` measurements, since speed is not the only sensible objective we might want to pursue.\n\n---\n\n## Custom events\n\n```\nperf stat -e instructions,cycles,L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./dot 100000000\na.b = 9.99999e+07\n\n Performance counter stats for './dot 100000000':\n\n     8,564,368,466      instructions:u            #    1.41  insn per cycle           (49.98%)\n     6,060,955,584      cycles:u                                                      (66.65%)\n        34,089,080      L1-dcache-load-misses:u   #    0.90% of all L1-dcache hits    (83.34%)\n     3,805,929,303      L1-dcache-loads:u                                             (83.32%)\n           854,522      LLC-load-misses:u         #   39.87% of all LL-cache hits     (33.31%)\n         2,143,437      LLC-loads:u                                                   (33.31%)\n\n       5.045450844 seconds time elapsed\n\n       2.856660000 seconds user\n       2.185739000 seconds sys\n```\n\n^ Hmm . . . 40% LL cache miss rate, yet 1.4 instructions/cycle. This CPU-bound vs memory-bound is a bit complicated . . .\n\n^ Personally I don't regard CPU-bound vs memory-bound to be an \"actionable\" way of thinking. We can turn a slow CPU bound program into a fast memory-bound program by just not doing dumb stuff.\n\n---\n\n## Custom events: gotchas\n\nThese events are not stable across CPU architectures, nor even `perf` versions!\n\nThe events expose the functionality of hardware counters; different hardware has different counters.\n\nAnd someone needs to do the work of exposing them in `perf`!\n\n---\n\n```\n$ perf list\n  cycle_activity.stalls_l1d_pending                 \n       [Execution stalls due to L1 data cache misses]\n  cycle_activity.stalls_l2_pending                  \n       [Execution stalls due to L2 cache misses]\n  cycle_activity.stalls_ldm_pending                 \n       [Execution stalls due to memory subsystem]\n$ perf stat -e cycle_activity.stalls_ldm_pending,cycle_activity.stalls_l2_pending,cycle_activity.stalls_l1d_pending,cycles ./dot 10000000\na.b = 9.99999e+07\n\n Performance counter stats for './dot 100000000':\n\n       509,998,525      cycle_activity.stalls_ldm_pending:u                                   \n       127,137,070      cycle_activity.stalls_l2_pending:u                                   \n        70,555,574      cycle_activity.stalls_l1d_pending:u                                   \n     5,708,220,052      cycles:u                                                    \n\n       3.637099623 seconds time elapsed\n\n       2.463966000 seconds user\n       1.172459000 seconds sys\n```\n\n---\n\n## Kinda painful typing these events: Use `-d` (`--detailed`)\n\n```\n$ perf stat -d ./dot 100000000\n Performance counter stats for './dot 100000000':\n\n          1,945.17 msec task-clock:u              #    0.970 CPUs utilized\n                 0      context-switches:u        #    0.000 K/sec\n                 0      cpu-migrations:u          #    0.000 K/sec\n           390,463      page-faults:u             #    0.201 M/sec\n     3,329,516,701      cycles:u                  #    1.712 GHz                      (49.97%)\n     1,272,884,914      instructions:u            #    0.38  insn per cycle           (62.50%)\n       150,445,759      branches:u                #   77.343 M/sec                    (62.55%)\n            14,766      branch-misses:u           #    0.01% of all branches          (62.53%)\n        76,672,490      L1-dcache-loads:u         #   39.417 M/sec                    (62.53%)\n        51,315,841      L1-dcache-load-misses:u   #   66.93% of all L1-dcache hits    (62.52%)\n         7,867,383      LLC-loads:u               #    4.045 M/sec                    (49.94%)\n         7,618,746      LLC-load-misses:u         #   96.84% of all LL-cache hits     (49.96%)\n\n       2.005801176 seconds time elapsed\n\n       0.982545000 seconds user\n       0.963534000 seconds sys\n```\n\n---\n\n## `perf stat -d` output on Andes\n\n```\n[nthompson@andes-login1]~/performance_tuning_tutorial% perf stat -d ./dot 1000000000\na.b = 1e+09\n\n Performance counter stats for './dot 1000000000':\n\n          2,242.43 msec task-clock:u              #    0.999 CPUs utilized\n                 0      context-switches:u        #    0.000 K/sec\n                 0      cpu-migrations:u          #    0.000 K/sec\n             8,456      page-faults:u             #    0.004 M/sec\n     2,972,264,893      cycles:u                  #    1.325 GHz                      (29.99%)\n         1,366,982      stalled-cycles-frontend:u #    0.05% frontend cycles idle     (30.02%)\n       747,429,126      stalled-cycles-backend:u  #   25.15% backend cycles idle      (30.07%)\n     3,499,896,128      instructions:u            #    1.18  insn per cycle\n                                                  #    0.21  stalled cycles per insn  (30.06%)\n       749,888,957      branches:u                #  334.410 M/sec                    (30.02%)\n             9,206      branch-misses:u           #    0.00% of all branches          (29.98%)\n     1,108,395,106      L1-dcache-loads:u         #  494.284 M/sec                    (29.97%)\n        36,998,921      L1-dcache-load-misses:u   #    3.34% of all L1-dcache accesses  (29.97%)\n                 0      LLC-loads:u               #    0.000 K/sec                    (29.97%)\n                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses  (29.97%)\n\n       2.244079417 seconds time elapsed\n\n       1.000742000 seconds user\n       1.214037000 seconds sys\n```\n\n^ Lots of backend cycles stalled in this one. This could be from high latency operations like divisions or from slow memory accesses.\n\n---\n\n## `perf stat` on a different type of computation\n\n```\n[nthompson@andes-login1]~/performance_tuning_tutorial% perf stat -d git archive --format=tar.gz --prefix=HEAD/ HEAD \u003e HEAD.tar.gz\n\n Performance counter stats for 'git archive --format=tar.gz --prefix=HEAD/ HEAD':\n\n             99.81 msec task-clock:u              #    0.795 CPUs utilized\n                 0      context-switches:u        #    0.000 K/sec\n                 0      cpu-migrations:u          #    0.000 K/sec\n             1,408      page-faults:u             #    0.014 M/sec\n       276,165,489      cycles:u                  #    2.767 GHz                      (28.07%)\n        72,227,873      stalled-cycles-frontend:u #   26.15% frontend cycles idle     (28.36%)\n        60,614,109      stalled-cycles-backend:u  #   21.95% backend cycles idle      (29.37%)\n       394,352,577      instructions:u            #    1.43  insn per cycle\n                                                  #    0.18  stalled cycles per insn  (30.58%)\n        66,882,750      branches:u                #  670.113 M/sec                    (31.95%)\n         2,974,856      branch-misses:u           #    4.45% of all branches          (32.23%)\n       183,326,327      L1-dcache-loads:u         # 1836.788 M/sec                    (31.19%)\n            49,245      L1-dcache-load-misses:u   #    0.03% of all L1-dcache accesses  (30.05%)\n                 0      LLC-loads:u               #    0.000 K/sec                    (29.37%)\n                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses  (28.84%)\n\n       0.125489614 seconds time elapsed\n\n       0.092736000 seconds user\n       0.006905000 seconds sys\n```\n\n^ Compression has much higher instruction complexity than a dot product, and we see that reflected here in the stalled frontend cycles. We also have a much higher branch miss rate.\n\n---\n\n## `perf stat` is great for reporting . . .\n\nBut not super actionable.\n\n---\n\n## Get Actionable Data\n\n```\n$ perf record -g ./dot 100000000\na.b = 9.99999e+07\n[ perf record: Woken up 3 times to write data ]\n[ perf record: Captured and wrote 0.735 MB perf.data (5894 samples) ]\n$ perf report -g -M intel\n```\n\n![inline](figures/perf_report_homescreen.png)\n\n---\n\n## Wait, what's actionable about this?\n\nSee how half the time is spend in the `std::vector` allocator?\n\nThat be a clue.\n\n---\n\n## Self and Children\n\n- The `Self` column says how much time was taken within the function.\n- The `Children` column says how much time was spent in functions called by the function.\n\n- If the `Children` column value is very near the `Self` column value, that function isn't your hotspot!\n\n\n---\n\nIf `Self` and `Children` is confusing, just get rid of it:\n\n```bash\n$ perf report -g -M intel --no-children\n```\n\n---\n\n## More intelligible `perf report`\n\n```bash\n$ perf report --no-children -s dso,sym,srcline\n```\n\nBest to put this in a `perf config`:\n\n```\n$ perf config --user report.children=false\n$ cat ~/.perfconfig\n[report]\n\tchildren = false\n```\n\n---\n\n## Some other nice config options\n\n```\n$ perf config --user annotate.disassembler_style=intel\n$ perf config --user report.percent-limit=0.1\n```\n\n---\n\n## Disassembly\n\n![inline](figures/dot_disassembly.png)\n\n---\n\n## What is happening?????\n\n- If you don't know x86 assembly, I recommend Ray Seyfarth's [Introduction to 64 Bit Assembly Language Programming for Linux and OS X](http://rayseyfarth.com/asm/)\n\n- If you need to look up instructions one at a time, Felix Cloutier's [x64 reference](https://www.felixcloutier.com/x86/) is a great resource.\n\n- If you need to examine how compiler flags interact with generated assembly, try [godbolt](https://godbolt.org).\n\n---\n\n## Detour: System V ABI (Linux)\n\n- Floating point arguments are passed in registers `xmm0-xmm7`.\n- Integer parameters are passed in registers `rdi`, `rsi`, `rdx`, `rcx`, `r8`, and `r9`, in that order.\n- A floating point return value is placed in register `xmm0`.\n- Integer return values are placed in `rax`.\n\nKnowing this makes your godbolt's a bit easier to read!\n\n---\n\n## The default assembly generated by gcc is braindead\n\nSee the [godbolt](https://godbolt.org/z/8qqhGj).\n\n- Superfluous stack writes.\n- No AVX instructions, no fused-multiply adds\n\nConsequence: Lots of time spent moving data around.\n\n---\n\n## Sidetrack: Fused-multiply add\n\nWe'll use the fused-multiply add instruction as a \"canonical\" example of an instruction which we *want* generated, but due to history, chaos, and dysfunction, generally *isn't*.\n\n---\n\n## Sidetrack: Fused-multiply add\n\nThe fma is defined as\n\n$$\n\\mathrm{fma}(a,b,c) := \\mathrm{rnd}(a*b + c)\n$$\n\ni.e., the multiplication and addition is performanced in a single instruction, with a single rounding.\n\n^ I recently determined `gcc` wasn't generating fma's in our flagship product VTK-m. It's often said that it's meaningless to talk about performance of code compiled without optimizations. Implicit in this statement is another: It's incredibly difficult to convince the compiler to generate optimized assembly! (The Intel compiler is very good in this regard.)\n\n---\n\n## My preferred CPPFLAGS:\n\n```\n-g -O3 -ffast-math -fno-finite-math-only -march=native -fno-omit-frame-pointer\n```\n\nHow does that look on [godbolt](https://godbolt.org/z/4dnfYb)?\n\nKey instruction: `vfmadd132pd`; vectorized fused multiply add on `xmm`/`ymm` registers.\n\n---\n\n## Recompile with good flags\n\n```\n$ make\n$ perf stat ./dot 100000000     \na.b = 9.99999e+07\n\n Performance counter stats for './dot 100000000':\n\n          2,428.06 msec task-clock:u              #    0.998 CPUs utilized          \n                 0      context-switches:u        #    0.000 K/sec                  \n                 0      cpu-migrations:u          #    0.000 K/sec                  \n           390,994      page-faults:u             #    0.161 M/sec                  \n     3,651,637,732      cycles:u                  #    1.504 GHz                    \n     1,676,766,309      instructions:u            #    0.46  insn per cycle         \n       225,636,250      branches:u                #   92.929 M/sec                  \n             9,303      branch-misses:u           #    0.00% of all branches        \n\n       2.432163719 seconds time elapsed\n```\n\n1/3rd of the instructions/cycle, yet twice as fast, because it ran ~1/5th the number of instructions.\n\n\n---\n\n# Exercise\n\nLook at the code of `src/mwe.cpp`. Is it really measuring a dot product? Look at it under `perf report`.\n\nFix it if not.\n\n^ The performance of `src/mwe.cpp` is dominated by the cost of initializing data.\n^ The data initialization converts integers to floats and does divisions. Removing these increases the performance.\n^ Even once this is done, 40% of the time is spent in data allocation. This indicates a need for a more sophisticated approach.\n\n---\n\n## Register width\n\n- The 8008 architecture from 1972 had 8 bit registers, now vaguely resembling our current `al` register.\n\n- 16 bit registers were added to the 8086 in 1972; now labelled `ax`.\n\n- 32 bit registers on the 80386 architecture in 1985; these are now prefixed with `e`, such as the `eax`, `ebx`, so on.\n\n- 64 bit registers were added in 2003 for the `x86_64` architecture. They are prefixed with `r`, such as the `rax` and `rbx` registers.\n\n---\n\n## Register width\n\nCompilers utilize the full width of integer registers without much fuss. The situation for floating point registers is much worse.\n\n\n---\n\n## Floating point register width\n\nAn `xmm` register is 128 bits wide, and can hold 2 doubles, or 4 floats.\n\nAVX2 introduced the `ymm` registers, which are 256 bits wide, and can hold 4 doubles, or 8 floats.\n\nAVX-512 (2016) introduced the `zmm` registers, which can hold 8 doubles or 16 floats.\n\n---\n\nTo determine if your CPU has `ymm` registers, check for avx2 instruction support:\n\n```bash\n$ lscpu | grep avx2\nFlags:\nfpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush\ndts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon\npebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64\nmonitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt\ntsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi\nflexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts\n```\n\nor (on Centos)\n\n```bash\n$ cat /proc/cpuinfo | grep avx2\n```\n\n---\n\n## Mind bogglement\n\nI couldn't get `gcc` or `clang` to generate AVX-512 instructions, so I went looking for the story . . .\n\n---\n\n## Mind bogglement\n\n\u003e I hope AVX512 dies a painful death, and that Intel starts fixing real problems instead of trying to create magic instructions to then create benchmarks that they can look good on. I hope Intel gets back to basics: gets their process working again, and concentrate more on regular code that isn't HPC or some other pointless special case.\n\n-- [Linus Torvalds](https://www.realworldtech.com/forum/?threadid=193189\u0026curpostid=193190)\n\n---\n\n## Vector instructions\n\nEven in the CS people don't like AVX-512, it is still difficult to find the magical incantations required to generate AVX2 instructions.\n\nIt generally requires an `-march=native` compiler flag.\n\n---\n\n## Beautiful assembly:\n\n![inline](figures/ymm_dot_product.png)\n\n---\n\n## Exercise\n\nOn *Andes*, what causes this error?\n\n```\n$ module load intel/19.0.3\n$ icc -march=skylake-avx512 src/mwe.cpp\n$ ./a.out 1000000\nzsh: illegal hardware instruction (core dumped)\n```\n\n---\n\nCompiler defaults are for *compatibility*, not for performance!\n\n---\n\n## `perf report` commands\n\n- `k`: Show line numbers of source code\n- `o`: Show instruction number\n- `t`: Switch between percentage and samples\n- `J`: Number of jump sources on target; number of places that can jump here.\n- `s`: Hide/Show source code\n- `h`: Show options\n\n---\n\n## perf gotchas\n\n- perf sometimes attributes the time in a single instruction to the *next* instruction.\n\n---\n\n## perf gotchas\n\n```\n     │         if (absx \u003c 1)\n7.76 │       ucomis xmm1,QWORD PTR [rbp-0x20]\n0.95 │     ↓ jbe    a6\n1.82 │       movsd  xmm0,QWORD PTR ds:0x46a198\n0.01 │       movsd  xmm1,QWORD PTR ds:0x46a1a0\n0.01 |       movsd  xmm2,QWORD PTR ds:0x46a100\n```\n\nHmm, so moving data into `xmm1` and `xmm2` is 182x faster than moving data into `xmm0` . . .\n\nLooks like a misattribution of the `jbe`.\n\n---\n\n\u003e . .  if you're trying to capture the IP on some PMC event, and there's a delay between the PMC overflow and capturing the IP, then the IP will point to the wrong address. This is skew. Another contributing problem is that micro-ops are processed in parallel and out-of-order, while the instruction pointer points to the resumption instruction, not the instruction that caused the event.\n\n--[Brendan Gregg](http://www.brendangregg.com/perf.html)\n\n---\n\n## Two sensible goals\n\nReduce power consumption, and/or reduce runtime.\n\nNot necessarily the same thing. Benchmark power consumption:\n\n```\n$ perf list | grep energy\n  power/energy-cores/                                [Kernel PMU event]\n  power/energy-pkg/                                  [Kernel PMU event]\n  power/energy-ram/                                  [Kernel PMU event]\n$ perf stat -e energy-cores ./dot 100000000\nPerformance counter stats for 'system wide':\n\n              8.55 Joules power/energy-cores/\n```\n\n---\n\n## Improving reproducibility\n\nFor small optimizations (\u003c 2% gains), our perf data often gets swamped in noise.\n\n```\n$ perf stat -e uops_retired.all,instructions,cycles -r 5 ./dot 100000000\n Performance counter stats for './dot 100000000' (5 runs):\n\n     1,817,358,542      uops_retired.all:u                                            ( +-  0.00% )\n     1,276,765,688      instructions:u            #    0.45  insn per cycle           ( +-  0.00% )\n     2,823,559,592      cycles:u                                                      ( +-  0.11% )\n\n            2.1110 +- 0.0422 seconds time elapsed  ( +-  2.00% )\n```\n\n---\n\n# Improving reproducibility\n\nSmall optimizations are really important, but really hard to measure reliably.\n\nSee [Producing wrong data without doing anything obviously wrong!](https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf)\n\nLink order, environment variables, [running in a new directory](https://youtu.be/koTf7u0v41o?t=1318), cache set of hot instructions can have huge impact on performance!\n\n---\n\n## Improving reproducibility\n\nInstruction count and uops are reproducible, but time and cycles are not.\n\nUse instruction count and uops retired as imperfect metrics for small optimizations when variance in runtime will swamp improvements.\n\n---\n\n## Long tail `perf`\n\nAttaching to a running process or MPI rank\n\n```bash\n$ top # find rogue process\n$ perf stat -d -p `pidof paraview`\n^C\n```\n\n---\n\n## Long tail `perf`\n\nSometimes, `perf` will gather *way* too much data, creating a huge `perf.data` file.\n\nSolved by reducing sampling frequency:\n\n```bash\n$ perf record -F 10 ./dot 100000000\n```\n\nor compressing (requires compilation with `zstd` support):\n\n```bash\n$ perf record -z ./dot 100000\n```\n\n---\n\n## Exercise\n\nReplace the computation of $$\\mathbf{a}\\cdot \\mathbf{b}$$ with the computation of $$\\left\\|\\mathbf{a}\\right\\|^2$$.\n\nThis halves the number of memory references/flop.\n\nIs it observable under `perf stat`?\n\n^ I see a meaningful reduction in L1 cache miss rate.\n\n---\n\n\n# Exercise\n\nParallelize the dot product using a framework of your choice.\n\nHow does it look under `perf`?\n\n\n---\n\n# Solution: OpenMP\n\n```cpp\ndouble dot_product(double* a, double* b, size_t n) {\n    double d = 0;\n    #pragma omp parallel for reduction(+:d)\n    for (size_t i = 0; i \u003c n; ++i) {\n        d += a[i]*b[i];\n    }\n    return d;\n}\n```\n\n---\n\n# Solution: C++17\n\n\n```cpp\ndouble d = std::transform_reduce(std::execution::par_unseq,\n                                 a.begin(), a.end(), b.begin(), 0.0);\n```\n\n(FYI: I had to do a [source build](https://github.com/oneapi-src/oneTBB/) of TBB to get this to work.)\n\n---\n\n## Parallel `perf` lessons\n\n`perf` is great at finding *hotspots*, not so great at finding coldspots.\n\n[hotspot](https://github.com/KDAB/hotspot), discussed later, will overcome this problem.\n\n---\n\nBreak?\n\n---\n\n## Session 2 Goals\n\n- Learn about google/benchmark\n- Profile entire workflows and generate flamegraphs and timecharts\n\n---\n\n## Challenges we need to overcome\n\n- Our MWE spent fully half its time initializing data. That's not very interesting.\n- We could only specify one vector length at a time. What if we'd written a performance bug that induced quadratic scaling?\n\n---\n\n## A [google/benchmark](https://github.com/google/benchmark/) [example](https://github.com/boostorg/math/blob/develop/reporting/performance/chebyshev_clenshaw.cpp):\n\n```bash\n$ ./reporting/performance/chebyshev_clenshaw.x --benchmark_filter=^ChebyshevClenshaw\n2020-10-16T15:36:34-04:00\nRunning ./reporting/performance/chebyshev_clenshaw.x\nRun on (16 X 2300 MHz CPU s)\nCPU Caches:\n  L1 Data 32 KiB (x8)\n  L1 Instruction 32 KiB (x8)\n  L2 Unified 256 KiB (x8)\n  L3 Unified 16384 KiB (x1)\nLoad Average: 2.49, 2.29, 2.09\n----------------------------------------------------------------------------\nBenchmark                                  Time             CPU   Iterations\n----------------------------------------------------------------------------\nChebyshevClenshaw\u003cdouble\u003e/2            0.966 ns        0.965 ns    637018028\nChebyshevClenshaw\u003cdouble\u003e/4             1.69 ns         1.69 ns    413440355\nChebyshevClenshaw\u003cdouble\u003e/8             4.26 ns         4.25 ns    161924589\nChebyshevClenshaw\u003cdouble\u003e/16            13.3 ns         13.3 ns     52107759\nChebyshevClenshaw\u003cdouble\u003e/32            39.4 ns         39.4 ns     17071255\nChebyshevClenshaw\u003cdouble\u003e/64             108 ns          108 ns      6438439\nChebyshevClenshaw\u003cdouble\u003e/128            246 ns          245 ns      2852707\nChebyshevClenshaw\u003cdouble\u003e/256            522 ns          521 ns      1316359\nChebyshevClenshaw\u003cdouble\u003e/512           1100 ns         1100 ns       640076\nChebyshevClenshaw\u003cdouble\u003e/1024          2180 ns         2179 ns       311353\nChebyshevClenshaw\u003cdouble\u003e/2048          4499 ns         4496 ns       152754\nChebyshevClenshaw\u003cdouble\u003e/4096          9086 ns         9081 ns        79369\nChebyshevClenshaw\u003cdouble\u003e_BigO          2.27 N          2.26 N\nChebyshevClenshaw\u003cdouble\u003e_RMS              4 %             4 %\n```\n\n---\n\n## Goals for google/benchmark\n\n- Empirically determine asymptotic complexity; is it $$\\mathcal{O}(N)$$, $$\\mathcal{O}(N^2)$$, or $$\\mathcal{O}(\\log(N))$$?\n- Test inputs of different lengths\n- Test different types (`float`, `double`, `long double`)\n- Dominate the runtime with interesting and relevant operations so our `perf` traces are more meaningful.\n\n---\n\n## Installation\n\n- Grab a [release tarball](https://github.com/google/benchmark/releases)\n- `pip install google-benchmark`\n- `brew install google-benchmark`\n- `spack install benchmark`\n\n---\n\n## Installation\n\nSource build\n\n```bash\n$ git clone https://github.com/google/benchmark.git\n$ cd benchmark \u0026\u0026 mkdir build \u0026\u0026 cd build\nbuild$ cmake -DCMAKE_BUILD_TYPE=Release -DBENCHMARK_ENABLE_TESTING=OFF ../ -G Ninja\nbuild$ ninja\nbuild$ sudo ninja install\n```\n\n---\n\n## Example: `benchmarks/bench.cpp`\n\n```cpp\n#include \u003cvector\u003e\n#include \u003crandom\u003e\n#include \u003cbenchmark/benchmark.h\u003e\n\ntemplate\u003cclass Real\u003e\nvoid DotProduct(benchmark::State\u0026 state) {\n    std::vector\u003cReal\u003e a(state.range(0));\n    std::vector\u003cReal\u003e b(state.range(0));\n    std::random_device rd;\n    std::uniform_real_distribution\u003cReal\u003e unif(-1,1);\n    for (size_t i = 0; i \u003c a.size(); ++i) {\n        a[i] = unif(rd);\n        b[i] = unif(rd);\n    }\n\n    for (auto _ : state) {\n        benchmark::DoNotOptimize(dot_product(a.data(), b.data(), a.size()));\n    }\n    state.SetComplexityN(state.range(0));\n}\n\nBENCHMARK_TEMPLATE(DotProduct, float)-\u003eRangeMultiplier(2)-\u003eRange(1\u003c\u003c3, 1\u003c\u003c18)-\u003eComplexity();\nBENCHMARK_TEMPLATE(DotProduct, double)-\u003eDenseRange(8, 1024*1024, 512)-\u003eComplexity();\nBENCHMARK_TEMPLATE(DotProduct, long double)-\u003eRangeMultiplier(2)-\u003eRange(1\u003c\u003c3, 1\u003c\u003c18)-\u003eComplexity(benchmark::oN);\n\nBENCHMARK_MAIN();\n```\n\n---\n\nInstantiate a benchmark on type float:\n\n```cpp\nBENCHMARK_TEMPLATE(DotProduct, float);\n```\n\nTest on vectors of length 8, 16, 32,.., 262144:\n\n```\n-\u003eRangeMultiplier(2)-\u003eRange(1\u003c\u003c3, 1\u003c\u003c18)\n```\n\nRegress the performance data against $$\\mathcal{O}(\\log(n)), \\mathcal{O}(n), \\mathcal{O}(n^2), \\mathcal{O}(n^3)$$:\n\n```\n-\u003eComplexity();\n```\n\n---\n\nForce regression against $$\\mathcal{O}(n)$$:\n\n```\n-\u003eComplexity(benchmark::oN);\n```\n\nRepeat the calculation until confidence in the runtime is obtained:\n\n```cpp\nfor (auto _ : state) { ... }\n```\n\nMake sure the compiler doesn't elide these instructions:\n\n```cpp\nbenchmark::DoNotOptimize(dot_product(a.data(), b.data(), a.size()));\n```\n\n---\n\n## google/benchmark party tricks: Visualize complexity\n\nSet a counter to the length of the vector:\n\n```\nstate.counters[\"n\"] = state.range(0);\n```\n\nThen get the output as CSV:\n\n```\nbenchmarks$ ./dot_bench --benchmark_format=csv\n```\n\nFinally, copy-paste the console output into [scatterplot.online](https://scatterplot.online/)\n\n---\n\n![inline](figures/benchmark_linear_complexity.png)\n\n---\n\n## `SetBytesProcessed`\n\nWe can attack the memory-bound vs CPU bound problem via `SetBytesProcessed`.\n\n```\n ./dot_bench --benchmark_filter=DotProduct\\\u003cdouble\n2020-10-18T12:33:41-04:00\nRunning ./dot_bench\nRun on (16 X 4300 MHz CPU s)\nCPU Caches:\n  L1 Data 32 KiB (x8)\n  L1 Instruction 32 KiB (x8)\n  L2 Unified 1024 KiB (x8)\n  L3 Unified 11264 KiB (x1)\nLoad Average: 0.63, 0.54, 0.72\n-------------------------------------------------------------------------------------\nBenchmark                           Time             CPU   Iterations UserCounters...\n-------------------------------------------------------------------------------------\nDotProduct\u003cdouble\u003e/64           0.004 us        0.004 us    155850953 bytes_per_second=212.277G/s n=64\nDotProduct\u003cdouble\u003e/128          0.010 us        0.010 us     73113102 bytes_per_second=200.232G/s n=128\nDotProduct\u003cdouble\u003e/256          0.015 us        0.015 us     45589300 bytes_per_second=247.706G/s n=256\nDotProduct\u003cdouble\u003e/512          0.029 us        0.029 us     24430471 bytes_per_second=266.21G/s n=512\nDotProduct\u003cdouble\u003e/1024         0.056 us        0.056 us     12490510 bytes_per_second=273.686G/s n=1024\nDotProduct\u003cdouble\u003e/2048         0.158 us        0.158 us      4413687 bytes_per_second=193.436G/s n=2.048k\nDotProduct\u003cdouble\u003e/4096         0.676 us        0.676 us      1035341 bytes_per_second=90.2688G/s n=4.096k\nDotProduct\u003cdouble\u003e/8192          1.33 us         1.33 us       520428 bytes_per_second=91.5784G/s n=8.192k\nDotProduct\u003cdouble\u003e/16384         2.71 us         2.71 us       258728 bytes_per_second=89.9407G/s n=16.384k\nDotProduct\u003cdouble\u003e/32768         5.51 us         5.51 us       127636 bytes_per_second=88.5911G/s n=32.768k\nDotProduct\u003cdouble\u003e/65536         19.9 us         19.9 us        35225 bytes_per_second=49.1777G/s n=65.536k\nDotProduct\u003cdouble\u003e/131072        77.7 us         77.7 us         9013 bytes_per_second=25.141G/s n=131.072k\nDotProduct\u003cdouble\u003e/262144         157 us          157 us         4458 bytes_per_second=24.8915G/s n=262.144k\nDotProduct\u003cdouble\u003e/524288         330 us          330 us         2129 bytes_per_second=23.6636G/s n=524.288k\nDotProduct\u003cdouble\u003e/1048576        812 us          812 us          835 bytes_per_second=19.2495G/s n=1048.58k\n```\n\n---\n\n## Is this good or not?\n\n```bash\n$ sudo lshw -class memory\n  *-memory:0\n       description: System Memory\n       physical id: 3d\n       slot: System board or motherboard\n     *-bank:0\n          description: DIMM DDR4 Synchronous 2666 MHz (0.4 ns)\n          physical id: 0\n          serial: #@\n          slot: CPU1_DIMM_A0\n          size: 8GiB\n          width: 64 bits\n          clock: 2666MHz (0.4ns)\n```\n\nSo our RAM can transfer 8bytes at 2.666Ghz--19.2GB/second.\n\n---\n\n## Exercise\n\nDetermine the size of the lowest level cache on your machine.\n\nCan you empirically observe cache effects?\n\nHint: Use the `DenseRange` option.\n\n---\n\n## Long tail `google/benchmark`\n\nIf you have root, you can decrease run-to-run variance via\n\n```\n$ sudo cpupower frequency-set --governor performance\n```\n\n---\n\n# perf + google/benchmark\n\n```\n$ perf record -g ./dot_bench --benchmark_filter=DotProduct\\\u003cdouble\n$ perf annotate\nPercent│       cmp           rdi,0x2                                                                                         ▒\n       │     ↓ jbe           795                                                                                             ▒\n  0.30 │       xor           eax,eax                                                                                         ▒\n  0.04 │       vxorpd        xmm0,xmm0,xmm0                                                                                  ▒\n       │       nop                                                                                                           ◆\n       │     d += a[i]*b[i];                                                                                                 ▒\n 25.30 │2e0:┌─→vmovupd       ymm2,YMMWORD PTR [r13+rax*1+0x0]                                                                ▒\n 65.58 │    │  vfmadd231pd   ymm0,ymm2,YMMWORD PTR [r12+rax*1]                                                               ▒\n       │    │for (unsigned long long i = 0; i \u003c n; ++i) {                                                                    ▒\n  1.94 │    │  add           rax,0x20                                                                                        ▒\n       │    ├──cmp           rdx,rax                                                                                         ▒\n  1.99 │    └──jne           2e0                                                                                             \n```\n\nNow most of our time is spent in the interesting part of our code.\n\n---\n\n# perf + google/benchmark gotchas\n\nIn constrast to our previous examples, the instructions and uops count are *not* stable.\n\n\u003e The number of iterations to run is determined dynamically by running the benchmark a few times and measuring the time taken and ensuring that the ultimate result will be statistically stable.\n--[Google benchmark docs](https://github.com/google/benchmark)\n\n---\n\n## Exercise\n\nProfile a squared norm using google/benchmark.\n\nCompute it in both `float` and `double` precision, determine asymptotic complexity, and the number of bytes/second you are able to process.\n\n---\n\n## Exercise\n\nCompare interpolation search to binary search use `perf` and `googlebenchmark`.\n\n---\n\n## Break?\n\n---\n\n## Session 3\n\nFlamegraphs\n\n---\n\n# _What is google/benchmark not good for?_\n\nProfiling workflows. It's a *microbenchmark* library.\n\nBut huge problems can often arise integrating even well-designed and performant functions.\n\nWhat to do?\n\n---\n\n## [Flamegraph](https://gitlab.kitware.com/vtk/vtk-m/-/issues/499) of VTK-m graphics pipeline\n\n\n![inline](figures/read_portal.svg)\n\n---\n\nFlamegraphs present *sorted* unique stack frames, width drawn proportional to samples in that frame/total samples.\n\nSorting the stack frames means the x-axis is not a time axis! Great for multithreaded code. x-axis is sorted alphabetically.\n\ny-axis is that callstack.\n\nSee the [paper](https://queue.acm.org/detail.cfm?id=2927301).\n\n---\n\n## Flamegraphs\n\n```\n$ git clone https://github.com/brendangregg/FlameGraph.git\n```\n\n---\n\n## Flamegraph MWE\n\nIn a directory with a `perf.data` file, run\n\n```\n$ perf script | ~/FlameGraph/stackcollapse-perf.pl| ~/FlameGraph/flamegraph.pl \u003e flame.svg\n$ firefox flame.svg\n```\n\nI find this hard to remember, so I have an alias:\n\n```\n$ alias | grep flame\nflamegraph='perf script | ~/FlameGraph/stackcollapse-perf.pl| ~/FlameGraph/flamegraph.pl \u003e flame.svg'\n```\n\n---\n\n## Viewing flamegraphs\n\nFirefox is best, but no Firefox on Andes. Try ImageMagick:\n\n```\n$ ssh -X `whoami`@andes.olcf.ornl.gov\n$ module load imagemagick/7.0.8-7-py3\n$ magick display flame.svg\n```\n\n\n\n---\n\n## Flamegraph example: VTK-m Volume Rendering\n\n```\n$ git clone https://gitlab.kitware.com/vtk/vtk-m.git\n$ cd vtk-m \u0026\u0026 mkdir build \u0026\u0026 cd build\n$ cmake ../ \\\n   -DCMAKE_CXX_FLAGS=\"${CMAKE_CXX_FLAGS} -march=native -fno-omit-frame-pointer -Wfatal-errors -ffast-math -fno-finite-math-only -O3 -g\" \\\n   -DVTKm_ENABLE_EXAMPLES=ON -DVTKm_ENABLE_OPENMP=ON -DVTKm_ENABLE_TESTING=OFF -G Ninja  \n$ ninja\n$ perf stat -d ./examples/demo/Demo\n$ perf record -g ./examples/demo/Demo\n```\n\nNote: If you have a huge program you'd like to profile, compile it now and follow along!\n\n---\n\n# Step by step: `perf script`\n\nDumps all recorded stack traces\n\n```\n$ perf script\nperf 20820 510465.112358:          1 cycles:\n        ffffffff9f277a8a native_write_msr+0xa ([kernel.kallsyms])\n        ffffffff9f20d7ed __intel_pmu_enable_all.constprop.31+0x4d ([kernel.kallsyms])\n        ffffffff9f20dc29 intel_tfa_pmu_enable_all+0x39 ([kernel.kallsyms])\n        ffffffff9f207aec x86_pmu_enable+0x11c ([kernel.kallsyms])\n        ffffffff9f40ac26 ctx_resched+0x96 ([kernel.kallsyms])\n        ffffffff9f415562 perf_event_exec+0x182 ([kernel.kallsyms])\n        ffffffff9f4e65e2 setup_new_exec+0xc2 ([kernel.kallsyms])\n        ffffffff9f55a9ff load_elf_binary+0x3af ([kernel.kallsyms])\n        ffffffff9f4e4441 search_binary_handler+0x91 ([kernel.kallsyms])\n        ffffffff9f4e5696 __do_execve_file.isra.39+0x6f6 ([kernel.kallsyms])\n        ffffffff9f4e5a49 __x64_sys_execve+0x39 ([kernel.kallsyms])\n        ffffffff9f204417 do_syscall_64+0x57 ([kernel.kallsyms])\n```\n\n---\n\n## Step by step: `stackcollapse-perf.pl`\n\n\nMerges duplicate stack samples, sorts them alphabetically:\n\n```\n$ perf script | ~/FlameGraph/stackcollapse-perf.pl \u003e out.folded\n$ cat out.folded | more\nDemo;[libgomp.so.1.0.0] 1856\nDemo;[unknown];[libgomp.so.1.0.0];vtkm::cont::DeviceAdapterAlgorithm\u003cvtkm::cont::DeviceAdapterTagOpenMP\u003e::ScheduleTask 8\n```\n\n\n---\n\n## Generate a stack\n\n```\n$ perf script | ~/FlameGraph/stackcollapse-perf.pl \u003e out.folded\n$ ~/FlameGraph/flamegraph.pl out.folded --title=\"VTK-m rendering and isocontouring\" \u003e flame.svg\n```\n\n---\n\n## Convert sorted/unique stack frames into pretty picture\n\n\n```\n$ cat out.folded | ~/FlameGraph/flamegraph.pl\n\u003c?xml version=\"1.0\" standalone=\"no\"?\u003e\n\u003c!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\"\u003e\n\u003csvg version=\"1.1\" width=\"1200\" height=\"2134\" onload=\"init(evt)\" viewBox=\"0 0 1200 2134\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.\norg/1999/xlink\"\u003e\n\u003c!-- Flame graph stack visualization. See https://github.com/brendangregg/FlameGraph for latest version, and http://www.brendangregg.com/flamegraphs.ht\nml for examples. --\u003e\n\u003c!-- NOTES:  --\u003e\n\u003cdefs\u003e\n        \u003clinearGradient id=\"background\" y1=\"0\" y2=\"1\" x1=\"0\" x2=\"0\" \u003e\n                \u003cstop stop-color=\"#eeeeee\" offset=\"5%\" /\u003e\n                \u003cstop stop-color=\"#eeeeb0\" offset=\"95%\" /\u003e\n        \u003c/linearGradient\u003e\n\u003c/defs\u003e\n\u003cstyle type=\"text/css\"\u003e\n```\n\n---\n\n## Analyzing only one particular function\n\n```\n$ grep BVHTraverser out.folded | ~/FlameGraph/flamegraph.pl \u003e flame.svg\n```\n\n---\n\n## perf in other languages and contexts\n\nSee [Brendan Gregg's](https://youtu.be/tAY8PnfrS_k) YOW! keynote for Java performance analysis using this workflow.\n\n---\n\n## VTK-m Volume Rendering with OpenMP\n\n![inline](figures/vtkm_openmp.svg)\n\n---\n\n## VTK-m Volume Rendering (TBB)\n\n![inline](figures/vtkm_tbb_rendering.svg)\n\n---\n\n## VTK-m Volume Rendering with CUDA\n\n![inline](figures/vtkm_cuda.svg)\n\n---\n\n## `perf` GUI?\n\nYou can use [hotspot](https://github.com/KDAB/hotspot) if you like GUIs.\n\nhotspot also has a number of excellent alternative visualizations created from the `perf.data` file.\n\n---\n\n## Installing hotspot\n\nDownload the [AppImage](https://github.com/KDAB/hotspot/releases),\n\n```bash\n$ chmod a+x Hotspot-git.102d4b7-x86_64.AppImage\n$ ./Hotspot-git.102d4b7-x86_64.AppImage\n```\n\nNote: On Andes, use\n\n```bash\n$ ./Hotspot-git.102d4b7-x86_64.AppImage --appimage-extract-and-run\n```\n\n\n---\n\n## Wait-time analysis (i.e., off-CPU profiling)\n\nFlamegraphs show the expense of *executing instructions*.\n\nIn multithreaded environments, we often need to know the expense of *doing nothing*.\n\nGetting an idea of which cores are doing nothing is called \"off-CPU profiling\", or \"wait-time analysis\".\n\n---\n\n## Other Criticisms of Flamegraphs\n\nSee [here](https://stackoverflow.com/a/25870103) for numerous ways for problems to hide from Flamegraphs.\n\n---\n\n## Off-CPU profiling with hotspot\n\n![inline](figures/HotSpotOffCPU.png)\n\n---\n\n## Off-CPU profiling: `perf`\n\n```bash\n$ perf record --call-graph dwarf --event cycles --switch-events --event sched:sched_switch \\\n   --aio --sample-cpu ~/vtk-m/build/examples/demo/Demo\n```\n\n---\n\n## Off-CPU profiling: `hotspot`\n\n![inline](figures/hotspotoffcpuconfig.png)\n\n---\n\n## Discussion: Is this bad or good?\n\n---\n\n## My conclusion:\n\nAmdahl's law is very harsh--our thread utilization gets destroyed by the lz77 encoding of the PNG.\n\nNot exactly what you hope for when you are computing an isocontour.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNAThompson%2Fperformance_tuning_tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNAThompson%2Fperformance_tuning_tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNAThompson%2Fperformance_tuning_tutorial/lists"}