{"id":13785760,"url":"https://github.com/c3sr/comm_scope","last_synced_at":"2026-01-17T01:37:48.540Z","repository":{"id":46023195,"uuid":"140423406","full_name":"c3sr/comm_scope","owner":"c3sr","description":"NUMA-aware multi-CPU multi-GPU data transfer benchmarks","archived":false,"fork":false,"pushed_at":"2023-10-26T20:57:27.000Z","size":681,"stargazers_count":21,"open_issues_count":14,"forks_count":3,"subscribers_count":7,"default_branch":"master","last_synced_at":"2024-11-17T22:35:48.461Z","etag":null,"topics":["bandwidth","benchmark-suite","cuda","gpu","hip","numa","nvlink","performance"],"latest_commit_sha":null,"homepage":"https://github.com/c3sr/scope","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/c3sr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2018-07-10T11:41:44.000Z","updated_at":"2024-09-25T06:29:51.000Z","dependencies_parsed_at":"2023-01-30T00:45:51.175Z","dependency_job_id":"e113b334-46ee-4bd3-b94c-ea5838c72e7e","html_url":"https://github.com/c3sr/comm_scope","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c3sr%2Fcomm_scope","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c3sr%2Fcomm_scope/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c3sr%2Fcomm_scope/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c3sr%2Fcomm_scope/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/c3sr","download_url":"https://codeload.github.com/c3sr/comm_scope/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253638937,"owners_count":21940434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bandwidth","benchmark-suite","cuda","gpu","hip","numa","nvlink","performance"],"created_at":"2024-08-03T19:01:04.173Z","updated_at":"2026-01-17T01:37:48.529Z","avatar_url":"https://github.com/c3sr.png","language":"C++","funding_links":[],"categories":["Benchmarking"],"sub_categories":[],"readme":"# Comm|Scope\n\n![Build Status](https://github.com/c3sr/comm_scope/actions/workflows/docker-image.yml/badge.svg)\n\n## Prerequisites\n\n* CMake 3.18+\n* g++ \u003e= 4.9\n* CUDA toolkit \u003e= 8.0 or ROCm \u003e= 5.2.0\n\n## Getting started\n\nRecursive git clone:\n```\ngit clone --recursive https://github.com/c3sr/comm_scope.git\n```\n\nOr, if you cloned without recursiveness:\n```\n\u003cdowload or clone Comm|Scope\u003e\ngit submodule update --init --recursive\n```\n\nBuild and list supported benchmarks:\n```\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake\n./comm_scope --benchmark_list_tests=true\n```\n\nTo choose specific benchmarks, filter by regex:\n\n```\n./comm_scope --benchmark_list_tests --benchmark_filter=\u003cregex\u003e\n```\n\nOnce the desired benchmarks are selected, run them\n\n```\n./comm_scope --benchmark_filter=\u003cregex\u003e\n```\n\n## Advanced\n\nCSV Output (will still print on console):\n```\n./comm_scope --benchmark_out=file.csv --benchmark_out_format=csv\n```\n\nTo limit the visible GPUs, use the `--cuda` option:\n\n```\n./comm_scope --cuda 0 --cuda 1\n```\n\nTo limit the visible NUMA nodes, use the `--numa` option:\n\n```\n./comm_scope --numa 8\n```\n\nComm|Scope will attempt to control CPU clocks. Either run with elevated permissions, or you will see:\n```\n[2020-07-15 17:58:00.763] [scope] [error] unable to disable CPU turbo: no permission. Run with higher privileges?\n[2020-07-15 17:58:00.763] [scope] [error] unable to set OS CPU governor to maximum: no permission. Run with higher privileges?\n```\n\nIf you are willing to accept reduced accuracy, or are on a system where CPU clocks do not need to be controlled, you can ignore this error.\n\nYou can control the log verbosity with the following environment variables:\n* `SPDLOG_LEVEL=trace`\n* `SPDLOG_LEVEL=debug`\n* `SPDLOG_LEVEL=info`\n* `SPDLOG_LEVEL=warning`\n* `SPDLOG_LEVEL=critical`\n\n## Warning: Inconsistent Console Reporting Suffixes\n\nGoogle Benchmark will format the console output in the following way, with an inconsistency.\nThe `bytes` suffixes (`k`, `M`, `G`) are powers of 10 (`1e3`, `1e6`, `1e9`), while the `bytes_per_second` suffixes are powers of 2 (`2^10`, `2^20`, `2^30`).\nFor example, the raw values for line 12 are `bytes=4096` and `bytes_per_second=1.33407e+09`.\nUsing the `csv` reporter prints the raw values to the file: `--benchmark_out=file.csv` and `--benchmark_out_format=csv`.\n```\n----------------------------------------------------------------------------------------------------------------------------\nBenchmark                                                                  Time             CPU   Iterations UserCounters...\n----------------------------------------------------------------------------------------------------------------------------\nComm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):8/manual_time              2804 ns   1065385791 ns       251315 bytes=256 bytes_per_second=87.0571M/s cuda_id=0 numa_id=0\nComm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):9/manual_time              2806 ns   1059562408 ns       250053 bytes=512 bytes_per_second=173.985M/s cuda_id=0 numa_id=0\nComm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):10/manual_time             2871 ns   1055014030 ns       246220 bytes=1024 bytes_per_second=340.196M/s cuda_id=0 numa_id=0\nComm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):11/manual_time             3033 ns   1070865035 ns       241507 bytes=2.048k bytes_per_second=643.883M/s cuda_id=0 numa_id=0\nComm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):12/manual_time             3070 ns    984282144 ns       224948 bytes=4.096k bytes_per_second=1.24245G/s cuda_id=0 numa_id=0\n\n```\n\n## Recipies for Specific Systems\n\n* [OLCF summit](summit.md)\n* [Sandia Caraway](caraway.md)\n* [Sandia Weaver](weaver.md)\n* [OLCF crusher](crusher.md)\n* [OLCF frontier](frontier.md)\n\n```\n[2020-07-15 17:58:00.763] [scope] [error] unable to disable CPU turbo: no permission. Run with higher privileges?\n[2020-07-15 17:58:00.763] [scope] [error] unable to set OS CPU governor to maximum: no permission. Run with higher privileges?\n```\n\n## FAQ / Troubleshooting\n\n** I get `CMake Error: Remove failed on file: \u003cblah\u003e: System Error: Device or resource busy`**\n\nThis somtimes happens on network file systems. You can retry, or do the build on a local disk.\n\n** I get `-- The CXX compiler identification is GNU 4.8.5` after `module load gcc/5.4.0`.\n\nA different version of GCC may be in the CMake cache.\nTry running `cmake -DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc`, or deleting your build directory and restarting.\n\n** I get `a PTX JIT compilation failed` **\n\nset `CUDAFLAGS` to be the appropriate `-arch=sm_xx` for your system. e.g. `export CUDAFLAGS=-arch=sm_80` for ThetaGPU.\n\n## Bumping the Version\n\nUpdate the changelog and commit the changes.\n\nInstall bump2version\n\n```pip install --user bump2version```\n\nCheck that everything seems good (minor version, for example)\n\n```bump2version --dry-run minor --verbose```\n\nActually bump the version\n\n```bump2version minor```\n\nPush the changes\n\n```git push \u0026\u0026 git push --tags```\n\n## Contributing\n\nAny work on the underlying `cwpearson/libscope` library will probably benefit from changing the submodule from http to SSH:\n\n```\ncd thirdparty/libscope\ngit remote set-url origin git@github.com:cwpearson/libscope.git\n```\n\n## Contributors\n\n* [Carl Pearson](mailto:cwpears@sandia.gov)\n* [Sarah Hashash](mailto:hashash2@illinois.edu)\n\n# Changelog\n\n## v0.12.0 (Aug 9 2023)\n* cwpearson/libscope 124999dc0017b437adcbebeaded52cf9d973ac28\n* improve compiler compatibility\n* improve CMake support\n* add device synchronize benchmarks\n* add libc memcpy benchmark\n* add HIP benchmarks\n\n## v0.11.2 (July 17 2020)\n* cwpearson/libscope v1.1.2\n* silence some warnings\n\n## v0.11.1 (July 17 2020)\n* cwpearson/libscope v1.1.1\n\n## v0.11.0 (July 17 2020)\n* cwpearson/libscope v1.1.0\n* `cudaGraphInstantiate` and `cudaGraphLaunch`\n* Reduce maximum `cudaMemcpyPeerAsync` size, since it is not truly async above ~2^27 which breaks the measurement strategy.\n\n## v0.10.0 (June 23 2020)\n* Rely on `cwpearson/libscope` instead of `c3sr/scope`\n* `cwpearson/libscope` v1.0.0\n* Remove dependence on sugar\n* Add 3D strided memory transfer benchmarks\n* Add CUDA runtime microbenchmarks\n* Remove some duplicate NUMA-/non-NUMA-aware implementations of cudaMemcpyAsync benchmarks\n\n## v0.9.0 (June 5 2020)\n\n* Add CPU-GPU and GPU-GPU sparse data transfer benchmarks\n  * `cudaMemcpy3DAsync`\n  * `cudaMemcpy3DPeerAsync`\n  * `cudaMemcpy2DAsync`\n  * custom 3D kernel\n  * pack / `cudaMemcpyPeerAsync` / unpack\n\n## v0.8.2 (March 6 2020)\n\n* Fix a event-device mismatch in multi-GPU unidirectional `cudaMemcpyPeer` benchmarks\n\n## v0.8.1 (March 5 2020)\n\n* Disable peer access in non-peer `cudaMemcpyPeer` benchmarks\n\n## v0.8.0 (March 5 2020)\n\n* Add `cudaMemcpyPeer` uni/bidirectional benchmarks.\n\n## v0.7.2 (April 8 2019)\n\n* Add memory to the clobber list for for x86 and ppc64le cache flushing.\n\n## v0.7.1 (April 5 2019)\n\n* Add v0.7.0 and v0.7.1 changelog\n\n## v0.7.0 (April 5 2019)\n\n* Make POWER's cache flushing code match the linux kernel.\n* rename \"Coherence\" benchmarks to \"Demand\"\n* remove cudaStreamSynchronize from the timing path of zerocopy-duplex, demand-duplex, and prefetch-duplex\n* Transition to better use of CMake's CUDA language support\n* Use NVCC's compiler defines to check the CUDA version\n* Disable Comm|Scope by default during Scope compilation\n\n## v0.6.3 (Dec 20 2018)\n\n* Add `USE_NUMA` CMake option\n* Fix compile errors when USE_NUMA=0 or NUMA cannot be found \n\n## v0.6.2\n\n* Fix checking non-existent cudaDeviceProp field in CUDA \u003c 9\n\n## v0.6.1\n\n* Conform to updated SCOPE_REGSITER_AFTER_INIT\n\n## v0.6.0\n\n* Add unified memory allocation benchmarks\n* Flush CPU caches in zero-copy benchmarks\n* Add zerocopy duplex benchmarks\n* Add unified memory prefetch duplex benchmark\n* Add unified memory demand duplex benchmark\n* Conform to updated SCOPE_REGSITER_AFTER_INIT\n\n## v0.5.0\n\n* Add zero-copy benchmarks\n* Don't use nvToolsExt\n\n## v0.4.0\n\n* Add multithreaded Coherence GPU to Host benchmark\n* Programatically register most benchmarks based on system configuration\n* use cudaMemcpyAsync in numa-memcpy\n* Add travis and Dockerfiles\n* Use `aligned_alloc` in numa-memcpy/pinned-to-gpu\n* Add x86 and POWER cache control functions\n\n## v0.3.0\n\n* Rework documentation\n* Use `target_include_scope_directories` and `target_link_scope_libraries`.\n* Use Clara for flags.\n* Remove numa/rd and numa/wr.\n\n## v0.2.0\n\n* Add `--numa_ids` command line flag.\n* Use `--cuda_device_ids` and --`numa_ids` to select CUDA and NUMA devices for benchmarks.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc3sr%2Fcomm_scope","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fc3sr%2Fcomm_scope","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc3sr%2Fcomm_scope/lists"}