{"id":17336501,"url":"https://github.com/dr-noob/peakperf","last_synced_at":"2025-04-09T14:13:50.958Z","repository":{"id":111745074,"uuid":"124663596","full_name":"Dr-Noob/peakperf","owner":"Dr-Noob","description":"Achieve peak performance on x86 CPUs and NVIDIA GPUs","archived":false,"fork":false,"pushed_at":"2024-10-07T07:48:15.000Z","size":256,"stargazers_count":67,"open_issues_count":14,"forks_count":15,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-02T12:12:59.705Z","etag":null,"topics":["assembly","avx","cpu","cpu-frequency","cpu-microarchitecture","cuda","gflop","gpu","intrinsics","microarchitecture","microbenchmark","nvidia","performance"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Dr-Noob.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-10T14:30:12.000Z","updated_at":"2025-02-28T09:06:18.000Z","dependencies_parsed_at":"2024-01-14T09:30:22.393Z","dependency_job_id":"ce0d16eb-e9e0-40d8-bd86-b68ad8e1b410","html_url":"https://github.com/Dr-Noob/peakperf","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dr-Noob%2Fpeakperf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dr-Noob%2Fpeakperf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dr-Noob%2Fpeakperf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Dr-Noob%2Fpeakperf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Dr-Noob","download_url":"https://codeload.github.com/Dr-Noob/peakperf/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248054193,"owners_count":21039952,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["assembly","avx","cpu","cpu-frequency","cpu-microarchitecture","cuda","gflop","gpu","intrinsics","microarchitecture","microbenchmark","nvidia","performance"],"created_at":"2024-10-15T15:30:48.011Z","updated_at":"2025-04-09T14:13:50.941Z","avatar_url":"https://github.com/Dr-Noob.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# peakperf\nMicrobenchmark to achieve peak performance on x86_64 CPUs and NVIDIA GPUs.\n\n**Table of Contents**\n\u003c!-- UPDATE with: doctoc --notitle README.md --\u003e\n\u003c!-- START doctoc generated TOC please keep comment here to allow auto update --\u003e\n\u003c!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --\u003e\n\n\n- [1. Support](#1-support)\n  - [1.1 Software support](#11-software-support)\n  - [1.2 Hardware support](#12-hardware-support)\n- [2. Instalation](#2-instalation)\n  - [2.1 Building from source](#21-building-from-source)\n  - [2.2 Enabling and disabling support for CPU/GPU](#22-enabling-and-disabling-support-for-cpugpu)\n    - [CUDA is installed but peakperf is unable to find it](#cuda-is-installed-but-peakperf-is-unable-to-find-it)\n    - [Manually disabling compilation for CPU/GPU](#manually-disabling-compilation-for-cpugpu)\n- [3. Usage:](#3-usage)\n  - [3.1 Selecting CPU or GPU](#31-selecting-cpu-or-gpu)\n  - [3.2. The environment](#32-the-environment)\n  - [3.3. Microarchitecture detection](#33-microarchitecture-detection)\n  - [3.4. Options](#34-options)\n- [4. Understanding the microbenchmark](#4-understanding-the-microbenchmark)\n  - [4.1 What is \"peak performance\" anyway?](#41-what-is-peak-performance-anyway)\n  - [4.2 The formula (CPU)](#42-the-formula-cpu)\n  - [4.3 The formula (GPU)](#43-the-formula-gpu)\n  - [4.4 About the frequency to use in the formula](#44-about-the-frequency-to-use-in-the-formula)\n  - [4.5 What can I do if I do not get the expected results?](#45-what-can-i-do-if-i-do-not-get-the-expected-results)\n- [5. Evaluation](#5-evaluation)\n  - [Intel](#intel)\n  - [AMD](#amd)\n  - [NVIDIA](#nvidia)\n- [6. Microarchitecture table](#6-microarchitecture-table)\n  - [6.1 CPU](#61-cpu)\n  - [6.2 GPU](#62-gpu)\n\n\u003c!-- END doctoc generated TOC please keep comment here to allow auto update --\u003e\n\n# 1. Support\n\n## 1.1 Software support\npeakperf only works properly in Linux. peakperf under Windows / macOS has not been tested, so performance may not be optimal. Windows port may be implemented in the future (see [Issue #1](https://github.com/Dr-Noob/peakperf/issues/1))\n\n## 1.2 Hardware support\nSupported microarchitectures are:\n\n- **CPU (x86_64)**:\n  - Intel: Sandy Bridge\tand newer.\n  - AMD: Zen and newer.\n- **GPU**:\n  - NVIDIA: Compute Capabitliy \u003e= 2.0.\n\nFor a complete list of supported microarchitectures, see section [5](#5-evaluation).\n\nNOTES:\n- _Only GPUs that support to read the freqeuncy in real time (using freq.sh) can be actually evaluated._\n- _Other microarchitectures not mentioned here may also work._\n\n# 2. Instalation\nThere is a peakperf package available in Arch Linux ([peakperf-git](https://aur.archlinux.org/packages/peakperf-git)).\n\nIf you are in another distro, you can build `peakperf` from source.\n\n## 2.1 Building from source\nBuild the microbenchmark with the build script, which uses `cmake`:\n\n```\ngit clone https://github.com/Dr-Noob/peakperf\ncd peakperf\n./build.sh\n./peakperf\n```\n\n## 2.2 Enabling and disabling support for CPU/GPU\nBy default, peakperf will be built with support for CPU and GPU. The support for the GPU will only be enabled if CUDA is found. During the `cmake` execution, peakperf will print a summary where you can check which devices peakperf was compiled for.\n\n```\n-- ----------------------\n-- peakperf build report:\n-- CPU mode: ON\n-- GPU mode: ON\n-- ----------------------\n```\n\n### CUDA is installed but peakperf is unable to find it\nSometimes, `cmake` will fail to find CUDA even tough it is installed. To let `cmake` find CUDA, edit the build.sh script and use:\n- `-DCMAKE_CUDA_COMPILER=/path/to/nvcc`\n- `-DCMAKE_CUDA_COMPILER_TOOLKIT_ROOT=/path/to/cuda`\n\n### Manually disabling compilation for CPU/GPU\nUse `cmake` variables:\n- `-DENABLE_CPU_DEVICE=[ON|OFF]`\n- `-DENABLE_GPU_DEVICE=[ON|OFF]`\n\nFor example, building with `-DENABLE_CPU_DEVICE=OFF` results in:\n\n```\n-- ----------------------\n-- peakperf build report:\n-- CPU mode: OFF\n-- GPU mode: ON\n-- ----------------------\n```\n\n# 3. Usage:\n\n## 3.1 Selecting CPU or GPU\nBy default, peakperf will run on the CPU:\n\n```\n[noob@drnoob peakperf]$ ./peakperf -t 4\n\n-----------------------------------------------------\n    peakperf (https://github.com/Dr-Noob/peakperf)\n-----------------------------------------------------\n        CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz\n  Microarch: Haswell\n  Benchmark: Haswell (AVX2)\n Iterations: 1.00e+09\n      GFLOP: 640.00\n    Threads: 4\n\n   Nº  Time(s)  GFLOP/s\n    1  1.25743   508.97 *\n    2  1.25137   511.44 *\n    3  1.25141   511.42\n    ...................\n   12  1.25136   511.44\n-----------------------------------------------------\n Average performance:      511.43 +- 0.01 GFLOP/s\n-----------------------------------------------------\n* - warm-up, not included in average\n```\n\nTo manually select the device, use `-d [cpu|gpu]`. To run peakpef on the GPU:\n\n```\n[noob@drnoob peakperf]$ ./peakperf -d gpu\n\n------------------------------------------------------\n    peakperf (https://github.com/Dr-Noob/peakperf)\n------------------------------------------------------\n           GPU: GeForce GTX 970\n     Microarch: Maxwell\n    Iterations: 4.00e+08\n         GFLOP: 7987.20\n        Blocks: 13\n Threads/block: 768\n\n   Nº  Time(s)  GFLOP/s\n    1  1.87078  4269.44 *\n    2  1.84159  4337.12 *\n    3  1.84205  4336.03\n    ...................\n   12  1.84194  4336.31\n------------------------------------------------------\n Average performance:      4336.69 +- 0.91 GFLOP/s\n------------------------------------------------------\n* - warm-up, not included in average\n```\n\n## 3.2. The environment\nTo achieve the best performance, you should run this test with the computer working under minimum load (e.g, in non-graphics mode). If you are in a desktop machine, a good way to do this is by issuing `systemctl isolate multi-user.target`.\n\n## 3.3. Microarchitecture detection\npeakperf automatically detects your CPU/GPU and runs the best benchmark for your architecture.\n\n1. For the CPU, you can see all available benchmarks in peakperf and select which one you one to run:\n\n```\n[noob@drnoob peakperf]$ ./peakperf -l\nAvailable benchmark types:\n...\n[noob@drnoob peakperf]$ ./peakperf -b haswell\n```\n\n2. For the GPU, only one benchmark exists, and the optimality of the microbenchmark depends on the kernel launch configuration. peakperf automatically determine this configuration for your GPU.\n\n## 3.4. Options\npeakperf has many different options to tweak and expriment with your hardware. Use `-h` to print all available options\n\n_NOTE_: Some options are available only on CPU or GPU\n\n# 4. Understanding the microbenchmark\n## 4.1 What is \"peak performance\" anyway?\nPeak performance refers to the maximum performance that a chip (a CPU) can achieve. The more powerful the CPU is, the greater the peak performance can achieve. This performance is a theoretical limit, computed using a formula (see next section), measured in floating point operation per seconds (FLOP/s or GFLOP/s, which stands for gigaflops). This value establishes a performance limit that the CPU is unable to overcome. However, achieving the peak performance (the maximum performance for a given CPU) is a very hard (but also interesting) task. To do so, the software must take advantage of the full power of the CPU. peakperf is a microbenchmark that achieves peak performance on many different x86_64 microarchitectures.\n\n## 4.2 The formula (CPU)\n\n```\nN_CORES * FREQUENCY * FMA * UNITS * (SIZE_OF_VECTOR/32)\n```\n\n- N_CORES: The number of physical cores. In our example, it is **4**\n- FREQUENCY: The freqeuncy of the CPU measured in GHz. To measure this frequency is a bit tricky, see next section for more details. In our example, it is **3.997** (see where does this value come from in the next section).\n- FMA: If CPU supports FMA, the peak performance is multipled by 2. If not, it is multiplied by 1. In our example, it is **2**.\n- UNITS: CPUs can provide 1 or 2 functional units per core. Modern Intel CPUs usually provide 2, while AMD CPUs usually provide 1. In our example, it is **2**.\n- SIZE_OF_VECTOR: If CPU supports AVX, the size is 256 (because AVX is 256 bits long). If CPU supports AVX512, the size is 512. In our example, the size is **256**.\n\nFor the example of a i7-4790K, we have:\n\n```\n4 * 3.997 * 10^9 * 2 * 2 * (256/32) = 511.61 GFLOP/s\n```\n\nAnd, as you can see in the previous test, we got 511.43 GFLOP/S, which tell us that peakperf is working properly and our CPU is behaving exactly as we expected.\n\n## 4.3 The formula (GPU)\n\n```\nN_CORES * FREQUENCY * FMA\n```\n\nThe GPU formula is simpler. `N_CORES` in this case is simply the number of CUDA cores (in the case of NVIDIA GPUs). Modern GPUs usually support FMA.\n\n## 4.4 About the frequency to use in the formula\n\nWhile running this microbenchmark, your CPU will be executing AVX code, so the frequency of your CPU running this code is neither your base nor your turbo frequency. Please, have a look at [this document](http://www.dolbeau.name/dolbeau/publications/peak.pdf) (on section IV.B) for more information.\n\nThe AVX frequency for a specific CPU is sometimes available online. The most effective way I know to get this frequency is to to actually measure your CPU frequency on real time while running AVX code. You can use the script [freq.sh](https://github.com/Dr-Noob/peakperf/freq.sh) to achieve this:\n1. Run the microbenchmark in background (`./peakperf -r 4 -w 0 \u003e /dev/null \u0026`)\n2. Run the script (`./freq.sh`) which will fetch your CPU frequency in real time (use `.req.sh gpu` for measuring the GPU). In my case, I get:\n\n```\nEvery 0,2s: grep 'MHz' /proc/cpuinfo\n\ncpu MHz         : 3997.629\ncpu MHz         : 3997.629\ncpu MHz         : 3997.630\ncpu MHz         : 3997.630\ncpu MHz         : 3997.630\ncpu MHz         : 3997.630\ncpu MHz         : 3997.629\ncpu MHz         : 3997.630\n```\n\nAs you can see, i7-4790K's frequency while running AVX code is ~3997.630 MHz, which equals to 3.997 GHz. However, you may see that your frequency fluctuates too much, so that it's impossible to estimate the frequency of your CPU. This may happen because:\n1. The microbenchmark is not working correctly. Please create a [issue in github](https://github.com/Dr-Noob/peakperf/issues)\n2. Your CPU is not able to keep a stable frequency. This often happens if it's to hot, so the CPU is forced to low the frequency to not to melt itself.\n\n## 4.5 What can I do if I do not get the expected results?\nPlease create a [issue in github](https://github.com/Dr-Noob/peakperf/issues), posting the output of peakperf.\n\n# 5. Evaluation\nThis tables shows the performance of peakperf for each of the microarchitecture supported by the microbenchmark. **To see all the hardware tested, see [benchmarks](BENCHMARKS.md)**\n\n## Intel\n| uarch           | CPU                | AVX Clock    | PP (Formula) | PP (Experimental)  | Loss    |\n|:---------------:|:------------------:|:------------:|:------------:|:------------------:|:-------:|\n| Sandy Bridge    | i5-2400            | `3.192 GHz`  |  `102.14`    |  `100.64 +- 0.00`  | `1.46%` |\n| Ivy Bridge      | 2x Xeon E5-2650 v2 | `2.999 GHz`  |  `767.74`    |  `744.24 +- 3.85`  | `3.15%` |\n| Haswell         | i7-4790K           | `3.997 GHz`  |  `511.61`    |  `511.43 +- 0.01`  | `0.03%` |\n| Broadwell       | 2x Xeon E5-2698 v4 | `2.599 GHz`  | `3326.72`    | `3269.87 +- 14.42` | `1.73%` |\n| Skylake         | i5-6400            | `3.099 GHz`  |  `396.67`    |  `396.61 +- 0.01 ` | `0.06%` |\n| Knights Landing | Xeon Phi 7250      | `1.499 GHz`  | `5991.69`    | `5390.84 +- 7.83`  | `3.72%` |\n| Kaby Lake       | i5-8250U           | `2.700 GHz`  |  `345.60`    |  `343.57 +- 1.38`  | `0.59%` |\n| Coffee Lake     | i9-9900K           | `3.600 GHz`  |  `921.60`    |  `918.72 +- 1.13`  | `0.31%` |\n| Comet Lake      | i9-10900KF         | `4.100 GHz`  |  `1312.00`   | `1308.24 +- 0.30`  | `0.30%` |\n| Cascade Lake    | 2x Xeon Gold 6238  | `2.099 GHz`  |  `5910.78`   | `5851.60 +- 2.69`  | `1.01%` |\n| Ice Lake        | i5-1035G1          | `2.990 GHz`  |  `382.72`    |  `382.22 +- 0.18`  | `0.13%` |\n| Tiger Lake      | -                  | -            |  -           |  -                 | -       |\n| Rocket Lake     | i7-11700           | `4.400 GHz`  |  `1126.4`    |  `1121.69 +- 0.60` | `0.41%` |\n| Alder Lake      | i9-12900K          | `4.900 GHz`  |  `1727.8`    |  `1709.28 +- 0.22` | `1.07%` |\n\n## AMD\n| uarch | CPU              | AVX Clock    | PP (Formula) | PP (Experimental)  | Loss    |\n|:-----:|:----------------:|:------------:|:------------:|:------------------:|:-------:|\n| Zen   | -                | -            | -            | -                  | -       |\n| Zen+  | AMD Ryzen 5 2600 | `3.724 GHz`  | `357.50`     | `357.08 +- 0.03`   | `0.11%` |\n| Zen 2 | -                | -            | -            | -                  | -       |\n| Zen 3 | 2x AMD EPYC 7413 | `3.000 GHz`  | `4608.00`    | `4551.55 +- 21.45` | `1.24%` |\n\n## NVIDIA\n| C.C | uarch        | GPU         | Clock        | PP (Formula) | PP (Experimental)   | Loss    |\n|:---:|:------------:|:-----------:|:------------:|:------------:|:-------------------:|:-------:|\n| 5.2 | Maxwell      | GTX 970     | `1.341 GHz`  | `4462.84`    | `4333.92 +- 0.90`   | `2.97%` |\n| 6.1 | Pascal       | GTX 1080    | `1.860 GHz`  | `9523.20`    | `9397.97 +- 0.10`   | `1.33%` |\n| 7.5 | Turing       | RTX 2080 Ti | `1.905 GHz`  | `16581.12`   | `16373.28 +- 16.07` | `1.26%` |\n| 8.6 | Ampere       | -           | -            | -            | -                   | -       |\n| 9.0 | Ada Lovelace | -           | -            | -            | -                   | -       |\n\n_NOTE 1_: Performance measured on simple precision and GFLOP/s (gigaflops per second).\n\n_NOTE 2_: The clock information is retrieved _experimentally_. In other words, this data is not the theoretical values for each device, but the actual frequency measured on each device (using `freq.sh` script).\n\n_NOTE 3_: KNL performance is computed as PP * (6/7) (see [explanation](https://sites.utexas.edu/jdm4372/2018/01/22/a-peculiar-throughput-limitation-on-intels-xeon-phi-x200-knights-landing/)).\n\n_NOTE 4_: Sandy Bridge and Ivy Bridge have ADD and MUL VPUs that can be used in parallel. Therefore, Xeon E5-2650 v2 formula is computed as `FREQ * CORES * 2 * 2 * 8`. However, i5-2400 peak performance is computed as the half. The explanation for this is that ADD and MUL VPUs can only be used if CPU supports hyperthreading. If CPU do not support hyperthreading, one core is unable to fill both VPUs fast enough.\n\n# 6. Microarchitecture table\n\nThe following tables act as a summary of all supported microarchitectures with their characteristics.\n\n## 6.1 CPU\n| uarch           | FMA              | AVX512             | Slots | FPUs            | Latency         | Tested           | Refs |\n|:---------------:|:----------------:|:------------------:|:-----:|:---------------:|:---------------:|:----------------:|:----:|\n| Sandy Bridge    | :x:              | :x:                |     6 | 2 (ADD+MUL AVX) | 3 (ADD) 5 (MUL) |:heavy_check_mark:|  [1] |\n| Ivy Bridge      | :x:              | :x:                |     6 | 2 (ADD+MUL AVX) | 3 (ADD) 5 (MUL) |:heavy_check_mark:|  [2] |\n| Haswell         |:heavy_check_mark:| :x:                |    10 | 2 (FMA AVX2)    | 5 (FMA)         |:heavy_check_mark:|  [3] |\n| Broadwell       |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [3] |\n| Skylake         |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [3] |\n| Kaby Lake       |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [4] |\n| Coffee Lake     |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [5] |\n| Comet Lake      |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:| [10] |\n| Ice Lake        |:heavy_check_mark:| :heavy_check_mark: |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:| [12] |\n| Tiger Lake      |:heavy_check_mark:| :heavy_check_mark: |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:| [12] |\n| Rocket Lake     |:heavy_check_mark:| :heavy_check_mark: |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [?] |\n| Alder Lake      |:heavy_check_mark:| :heavy_check_mark: |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:|  [?] |\n| Knights Landing |:heavy_check_mark:| :heavy_check_mark: |    12 | 2 (FMA AVX512)  | 6 (FMA)         |:heavy_check_mark:|  [6] |\n| Piledriver      |:heavy_check_mark:| :x:                |     5 | 1 (FMA AVX)     | 5 (FMA)         |:x:               |  [?] |\n| Zen             |:heavy_check_mark:| :x:                |     5 | 1 (FMA AVX2)    | 5 (FMA)         |:x:               |  [7] |\n| Zen+            |:heavy_check_mark:| :x:                |     5 | 1 (FMA AVX2)    | 5 (FMA)         |:heavy_check_mark:|  [8] |\n| Zen 2           |:heavy_check_mark:| :x:                |    10 | 2 (FMA AVX2)    | 5 (FMA)         |:x:               |  [9] |\n| Zen 3           |:heavy_check_mark:| :x:                |     8 | 2 (FMA AVX2)    | 4 (FMA)         |:heavy_check_mark:| [11] |\n\nReferences:\n- [1]  [Agner Fog Instruction Tables (Page 199, VADDPS)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [2]  [Agner Fog Instruction Tables (Page 213, VADDPS)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [3]  [Intel Intrinsics Guide (_mm256_fmadd_ps)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=256_fmadd_ps\u0026expand=136,2553)\n- [4]  [Wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake#Pipeline)\n- [5]  [Agner Fog Instruction Tables (Page 299, VFMADD)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [6]  [Intel Intrinsics Guide (_mm512_fmadd_ps)](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=512_fmadd_ps\u0026expand=136,2553,2557)\n- [7]  [Agner Fog Instruction Tables (Page 99, VFMADD)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [8]  [Wikichip](https://en.wikichip.org/wiki/amd/microarchitectures/zen%2B#Pipeline)\n- [9]  [Agner Fog Instruction Tables (Page 111, VFMADD132PS)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [10] [Wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/comet_lake)\n- [11]  [Agner Fog Instruction Tables (Page 124, VFMADD132PS)](https://www.agner.org/optimize/instruction_tables.pdf)\n- [12]  [Agner Fog Instruction Tables (Page 347, VADDPS)](https://www.agner.org/optimize/instruction_tables.pdf)\n\n## 6.2 GPU\n| uarch   | Latency  | Tested           | Refs |\n|:-------:|:--------:|:----------------:|:----:|\n| Maxwell |  6       |:heavy_check_mark:|  [] |\n| Pascal  |  6       |:heavy_check_mark:|  [] |\n| Turing  |  4       |:heavy_check_mark:|  [] |\n| Ampere  |  ?       |:x:               |  [] |\n\n_NOTES:_\n- The fact that a CPU belongs to a microarchitecture does not imply that it supports the vector extensions shown in this table (e.g, Pentium Skylake does not support AVX).\n- Older microarchitectures may be added in the future. If I have not added olds architecture is because I can't test peakperf on them since I have not access to this hardware.\n- Ice Lake and Tiger Lake support AVX512 instructions but they only have 1 AVX512 VPU (at least in client versions), while it has 2 VPUs for AVX2. Because AVX512 runs in lower freqeuncy, the performance obtained with AVX2 (using 2 VPUs) is better than with AVX512 (using 1 VPU). Thus, peak performance is obtained using AVX2, although it supports AVX512 instruction set.\n- Slots column is calculated with `FPUs x Latency`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdr-noob%2Fpeakperf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdr-noob%2Fpeakperf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdr-noob%2Fpeakperf/lists"}