{"id":27643373,"url":"https://github.com/rrze-hpc/thebandwidthbenchmark","last_synced_at":"2025-10-04T11:58:02.177Z","repository":{"id":147293096,"uuid":"174952325","full_name":"RRZE-HPC/TheBandwidthBenchmark","owner":"RRZE-HPC","description":"The ultimate bandwidth benchmark","archived":false,"fork":false,"pushed_at":"2025-08-05T03:36:09.000Z","size":176,"stargazers_count":51,"open_issues_count":1,"forks_count":15,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-08-05T05:25:02.073Z","etag":null,"topics":["benchmark","c","cache","memory","stream"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RRZE-HPC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-03-11T08:00:29.000Z","updated_at":"2025-08-01T13:39:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"1492bdc4-69ac-41aa-8923-ed1f27c1a23d","html_url":"https://github.com/RRZE-HPC/TheBandwidthBenchmark","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/RRZE-HPC/TheBandwidthBenchmark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RRZE-HPC%2FTheBandwidthBenchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RRZE-HPC%2FTheBandwidthBenchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RRZE-HPC%2FTheBandwidthBenchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RRZE-HPC%2FTheBandwidthBenchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RRZE-HPC","download_url":"https://codeload.github.com/RRZE-HPC/TheBandwidthBenchmark/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RRZE-HPC%2FTheBandwidthBenchmark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278308622,"owners_count":25965654,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","c","cache","memory","stream"],"created_at":"2025-04-24T00:13:24.336Z","updated_at":"2025-10-04T11:58:02.171Z","avatar_url":"https://github.com/RRZE-HPC.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# The Bandwidth Benchmark\n\nThis is a collection of simple streaming kernels.\n\nApart from the micro-benchmark functionality this is also a blueprint for other\nmicro-benchmark applications.\n\nIt contains C modules for:\n\n- Aligned data allocation\n- Query and control affinity settings\n- Accurate wall clock timing\n\nMoreover the benchmark showcases a simple generic Makefile that can be used in\nother projects.\n\nYou may want to have a look at\n\u003chttps://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki\u003e for a collection of\nresults that were created using TheBandwidthBenchmark.\n\n## Overview\n\nThe benchmark is heavily inspired by John McCalpin's\n\u003chttps://www.cs.virginia.edu/stream/\u003e benchmark.\n\nIt contains the following streaming kernels with corresponding data access\npattern (Notation: S - store, L - load, WA - write allocate). All variables are\nvectors, s is a scalar:\n\n- init (S1, WA): Initilize an array: `a = s`. Store only.\n- sum (L1): Vector reduction: `s += a`. Load only.\n- copy (L1, S1, WA): Classic memcopy: `a = b`.\n- update (L1, S1): Update vector: `a = a * scalar`. Also load + store but\n  without write allocate.\n- triad (L2, S1, WA): Stream triad: `a = b + c * scalar`.\n- daxpy (L2, S1): Daxpy: `a = a + b * scalar`.\n- striad (L3, S1, WA): Schoenauer triad: `a = b + c * d`.\n- sdaxpy (L3, S1): Schoenauer triad without write allocate: `a = a + b * c`.\n\n## Build\n\n1. Configure the tool chain and additional options in `config.mk`:\n\n```make\n# Supported: GCC, CLANG, ICC, ICX\nTOOLCHAIN ?= CLANG\nENABLE_OPENMP ?= false\nENABLE_LIKWID ?= false\n\nOPTIONS  =  -DSIZE=120000000ull\nOPTIONS +=  -DNTIMES=10\nOPTIONS +=  -DARRAY_ALIGNMENT=64\n#OPTIONS +=  -DVERBOSE_AFFINITY\n#OPTIONS +=  -DVERBOSE_DATASIZE\n#OPTIONS +=  -DVERBOSE_TIMER\n```\n\nThe verbosity options enable detailed output about affinity settings, allocation\nsizes and timer resolution. If you uncomment `DVERBOSE_AFFINITY` the processor\nevery thread is currently scheduled on and the complete affinity mask for every\nthread is printed.\n\n_Notice:_ OpenMP involves significant overhead through barrier cost, especially\non systems with many memory domains. The default problem size is set to almost\n4GB to have enough work vs overhead. If you suspect that the result should be\nbetter you may try to further increase the problem size. To compare to original\nstream results on X86 systems you have to ensure that streaming store\ninstructions are used. For the ICC tool chain this is now the default (Option\n`-qopt-streaming-stores=always`).\n\n- Build with:\n\n```sh\nmake\n```\n\nYou can build multiple tool chains in the same directory, but notice that the\nMakefile is only acting on the one currently set. Intermediate build results are\nlocated in the `./build/\u003cTOOLCHAIN\u003e` directory.\n\n- Clean up intermediate build results for active tool chain, data files and plots with:\n\n```sh\nmake clean\n```\n\nClean all build results for all tool chains:\n\n```sh\nmake distclean\n```\n\n- Optional targets:\n\nGenerate assembler:\n\n```sh\nmake asm\n```\n\nThe assembler files will also be located in the `./build/\u003cTOOLCHAIN\u003e` directory.\n\nReformat all source files using `clang-format`. This only works if\n`clang-format` is in your path.\n\n```sh\nmake format\n```\n\n## Support for clang language server\n\nThe Makefile will generate a `.clangd` configuration to correctly set all\noptions for the clang language server. This is only important if you use an\neditor with LSP support and want to edit or explore the source code.\nIt is required to use GNU Make 4.0 or newer. While older make versions will\nwork, the generation of the `.clangd` configuration for the clang language\nserver will not work. The default Make version included in MacOS is 3.81! Newer make\nversions can be easily installed on MacOS using the\n[Homebrew](https://brew.sh/) package manager.\n\nAn alternative is to use [Bear](https://github.com/rizsotto/Bear), a tool that\ngenerates a compilation database for clang tooling. This method also will enable\nto jump to any definition without a previously opened buffer. You have to build\nTheBandwidthBenchmark one time with Bear as a wrapper:\n\n```sh\nbear -- make\n```\n\n## Usage\n\nTo run the benchmark call:\n\n```sh\n./bwBench-\u003cTOOLCHAIN\u003e [mode (optional)]\n```\n\nApart from the default parallel work sharing mode with fixed problem size\nTheBandwidthBenchmark also supports two modes with varying problem sizes:\nsequential (call with `seq` mode option) and throughput (call with `tp` mode\noption). These are intended for scanning the complete memory hierarchy instead\nof only the main memory domain. See below for details on how to use those modes.\n\n**NOTICE:** The `seq` and `tp` modes may take up to 30m or more, depending on the\nsystem.\n\nIn default mode the benchmark will output the results similar to the stream\nbenchmark. Results are validated. For threaded execution it is recommended to\ncontrol thread affinity.\n\nWe recommend to use `likwid-pin` for setting the number of threads used and to\ncontrol thread affinity:\n\n```sh\nlikwid-pin -C 0-3 ./bwbench-GCC\n```\n\nExample output for threaded execution:\n\n```txt\n-------------------------------------------------------------\n[pthread wrapper]\n[pthread wrapper] MAIN -\u003e 0\n[pthread wrapper] PIN_MASK: 0-\u003e1  1-\u003e2  2-\u003e3\n[pthread wrapper] SKIP MASK: 0x0\n        threadid 140271463495424 -\u003e core 1 - OK\n        threadid 140271455102720 -\u003e core 2 - OK\n        threadid 140271446710016 -\u003e core 3 - OK\nOpenMP enabled, running with 4 threads\n----------------------------------------------------------------------------\nFunction      Rate(MB/s)  Rate(MFlop/s)  Avg time     Min time     Max time\nInit:          22111.53    -             0.0148       0.0145       0.0165\nSum:           46808.59    46808.59      0.0077       0.0068       0.0140\nCopy:          30983.06    -             0.0207       0.0207       0.0208\nUpdate:        43778.69    21889.34      0.0147       0.0146       0.0148\nTriad:         34476.64    22984.43      0.0282       0.0278       0.0305\nDaxpy:         45908.82    30605.88      0.0214       0.0209       0.0242\nSTriad:        37502.37    18751.18      0.0349       0.0341       0.0388\nSDaxpy:        46822.63    23411.32      0.0281       0.0273       0.0325\n----------------------------------------------------------------------------\nSolution Validates\n```\n\n## Scaling runs\n\nApart from the highest sustained memory bandwidth also the scaling behavior\nwithin memory domains is a important system property.\n\nThere is a helper script downloadable at\n\u003chttps://github.com/RRZE-HPC/TheBandwidthBenchmark/wiki/util/extractResults.pl\u003e\nthat creates a text result file from multiple runs that can be used as input to\nplotting applications as gnuplot and xmgrace. This involves two steps: Executing\nthe benchmark runs and creating the data file.\n\nTo run the benchmark for different thread counts within a memory domain execute\n(this assumes bash or zsh):\n\n```sh\nfor nt in 1 2 4 6 8 10; do likwid-pin -q -C E:M0:$nt:1:2 ./bwbench-ICC \u003e dat/emmy-$nt.txt; done\n```\n\nIt is recommended to just use one thread per core in case the processor supports\nhyperthreading. Use whatever stepping you like, here a stepping of two was used.\nThe `-q` option suppresses output from `likwid-pin`. Above line uses the\nexpression based syntax, on systems with hyperthreading enabled (check with,\ne.g., `likwid-topology`) you have to skip the other hardware threads on each\ncore. For above system with 2 hardware threads per core this results in `-C\nE:M0:$nt:1:2`, on a system with 4 hardware threads per core you would need `-C\nE:M0:$nt:1:4`. The string before the dash (here emmy) can be arbitrary, but the\nthe extraction script expects the thread count after the dash. Also the file\nending has to be `.txt`. Please check with a text editor on some result files if\neverything worked as expected.\n\nTo extract the results and output in a plot table format execute:\n\n```sh\n./extractResults.pl ./dat\n```\n\nThe script will pick up all result files in the directory specified and create a\ncolumn format output file. In this case:\n\n```txt\n#nt     Init    Sum     Copy    Update  Triad   Daxpy   STriad  SDaxpy\n1       4109    11900   5637    8025    7407    9874    8981    11288\n2       8057    22696   11011   15174   14821   18786   17599   21475\n4       15602   39327   21020   28197   27287   33633   31939   37146\n6       22592   45877   29618   37155   36664   40259   39911   41546\n8       28641   46878   35763   40111   40106   41293   41022   41950\n10      33151   46741   38187   40269   39960   40922   40567   41606\n```\n\nPlease be aware the single core memory bandwidth as well as the scaling behavior\ndepends on the frequency settings.\n\n## Sequential vs Throughput mode: Sweeping over a range of problem size\n\nTheBandwidthBenchmark comes in 2 additional variants: Sequential and Throughput.\nThese 2 modes performs a sweep over different array sizes ranging from N = 100\ntill the array size N specified in `config.mk`.\n\n- **Sequential** - Runs TheBandwidthBenchmark in sequential mode for all\n  kernels. Command to run in sequential mode:\n\n```sh\n./bwBench-\u003cTOOLCHAIN\u003e seq\n```\n\n- **Throughput (Multi-threaded)** - Runs TheBandwidthBenchmark in multi-threaded\n  mode for all kernels. Requires flag **ENABLE_OPENMP=true** in `config.mk`.\n  Command to run in throughput mode:\n\n```sh\n./bwBench-\u003cTOOLCHAIN\u003e tp\n```\n\nEach of these modes output the results for each individual kernel. The output\nfiles will be created in the `./dat` directory.\n\n### Visualizing the data from the Sequential/Throughput modes\n\nRequired: Gnuplot 5.2+\n\nThe user can visualize the outputs from `./dat` directory using the provided\ngnuplot scripts. The scripts are located in `./gnuplot_script` directory where a\nbash file takes care of generating and executing the gnuplot commands. The plots\nfrom gnuplot can then be found in `./plot` directory.\n\nThere are 2 ways you can visualize the output:\n\n- **Plotting Array Size (N) vs Bandwidth (MB/s)** - this mode creates plot with\n  the Array Size (N) on x-axis and Bandwidth (MB/s) on y-axis. The Array size (N)\n  will be the same for each kernel. Use this makefile command to generate this\n  type of plot:\n\n```sh\nmake plot\n```\n\n- **Plotting Dataset Size (MB) vs Bandwidth (MB/s)** - this mode creates plot\n  with the Dataset Size (MB) on x-axis and Bandwidth (MB/s) on y-axis. The Dataset\n  size (MB) will be the different for each kernel. For example the total dataset\n  for Init kernel will be 4x times less than the total dataset size for the STriad\n  kernel.\n\n```sh\nmake plot_dataset\n```\n\nThe script also generates a combined plot with bandwidths from all the kernels\ninto one plot.\n\n## Caveats\nA few known caveats to the user, based on the experience with the compilers.\n\n- Intel oneAPI DPC++/C++ Compiler 2023.2.0 (icx/icpx compiler):\n  - NonTemporal Stores (aka Streaming Stores): We leave the choice to the user whether to use NT stores or not. \n    - If the user wants to use NT stores using `-qopt-streaming-stores=always` compiler flag, then the user has to avoid using the `-ffreestanding` compiler flag. This will not generate NT instructions, but generates calls to `__libirc_nontemporal_store@PLT` in the assembly. \n    - For the Througput mode with OpenMP, the icx/icpx compiler does not respect the `nontemporal()` clause with the OpenMP `simd` directive.\n\n    It's recommended not to use NT stores if the user wants to observe cache hierarchy when using the Sequential or Throughput mode.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frrze-hpc%2Fthebandwidthbenchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frrze-hpc%2Fthebandwidthbenchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frrze-hpc%2Fthebandwidthbenchmark/lists"}