Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/c3sr/comm_scope
NUMA-aware multi-CPU multi-GPU data transfer benchmarks
https://github.com/c3sr/comm_scope
bandwidth benchmark-suite cuda gpu hip numa nvlink performance
Last synced: about 2 months ago
JSON representation
NUMA-aware multi-CPU multi-GPU data transfer benchmarks
- Host: GitHub
- URL: https://github.com/c3sr/comm_scope
- Owner: c3sr
- License: apache-2.0
- Created: 2018-07-10T11:41:44.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-10-26T20:57:27.000Z (about 1 year ago)
- Last Synced: 2024-05-03T06:58:30.644Z (8 months ago)
- Topics: bandwidth, benchmark-suite, cuda, gpu, hip, numa, nvlink, performance
- Language: C++
- Homepage: https://github.com/c3sr/scope
- Size: 665 KB
- Stars: 15
- Watchers: 7
- Forks: 3
- Open Issues: 14
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-numa - Comm|Scope - A NUMA-aware multi-CPU multi-GPU CUDA data transfer benchmarks (Benchmarking)
README
# Comm|Scope
![Build Status](https://github.com/c3sr/comm_scope/actions/workflows/docker-image.yml/badge.svg)
## Prerequisites
* CMake 3.18+
* g++ >= 4.9
* CUDA toolkit >= 8.0 or ROCm >= 5.2.0## Getting started
Recursive git clone:
```
git clone --recursive https://github.com/c3sr/comm_scope.git
```Or, if you cloned without recursiveness:
```git submodule update --init --recursive
```Build and list supported benchmarks:
```
mkdir build && cd build
cmake ..
make
./comm_scope --benchmark_list_tests=true
```To choose specific benchmarks, filter by regex:
```
./comm_scope --benchmark_list_tests --benchmark_filter=
```Once the desired benchmarks are selected, run them
```
./comm_scope --benchmark_filter=
```## Advanced
CSV Output (will still print on console):
```
./comm_scope --benchmark_out=file.csv --benchmark_out_format=csv
```To limit the visible GPUs, use the `--cuda` option:
```
./comm_scope --cuda 0 --cuda 1
```To limit the visible NUMA nodes, use the `--numa` option:
```
./comm_scope --numa 8
```Comm|Scope will attempt to control CPU clocks. Either run with elevated permissions, or you will see:
```
[2020-07-15 17:58:00.763] [scope] [error] unable to disable CPU turbo: no permission. Run with higher privileges?
[2020-07-15 17:58:00.763] [scope] [error] unable to set OS CPU governor to maximum: no permission. Run with higher privileges?
```If you are willing to accept reduced accuracy, or are on a system where CPU clocks do not need to be controlled, you can ignore this error.
You can control the log verbosity with the following environment variables:
* `SPDLOG_LEVEL=trace`
* `SPDLOG_LEVEL=debug`
* `SPDLOG_LEVEL=info`
* `SPDLOG_LEVEL=warning`
* `SPDLOG_LEVEL=critical`## Warning: Inconsistent Console Reporting Suffixes
Google Benchmark will format the console output in the following way, with an inconsistency.
The `bytes` suffixes (`k`, `M`, `G`) are powers of 10 (`1e3`, `1e6`, `1e9`), while the `bytes_per_second` suffixes are powers of 2 (`2^10`, `2^20`, `2^30`).
For example, the raw values for line 12 are `bytes=4096` and `bytes_per_second=1.33407e+09`.
Using the `csv` reporter prints the raw values to the file: `--benchmark_out=file.csv` and `--benchmark_out_format=csv`.
```
----------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------------------------
Comm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):8/manual_time 2804 ns 1065385791 ns 251315 bytes=256 bytes_per_second=87.0571M/s cuda_id=0 numa_id=0
Comm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):9/manual_time 2806 ns 1059562408 ns 250053 bytes=512 bytes_per_second=173.985M/s cuda_id=0 numa_id=0
Comm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):10/manual_time 2871 ns 1055014030 ns 246220 bytes=1024 bytes_per_second=340.196M/s cuda_id=0 numa_id=0
Comm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):11/manual_time 3033 ns 1070865035 ns 241507 bytes=2.048k bytes_per_second=643.883M/s cuda_id=0 numa_id=0
Comm_cudaMemcpyAsync_PinnedToGPU/0/0/log2(N):12/manual_time 3070 ns 984282144 ns 224948 bytes=4.096k bytes_per_second=1.24245G/s cuda_id=0 numa_id=0```
## Recipies for Specific Systems
* [OLCF summit](summit.md)
* [Sandia Caraway](caraway.md)
* [Sandia Weaver](weaver.md)
* [OLCF crusher](crusher.md)
* [OLCF frontier](frontier.md)```
[2020-07-15 17:58:00.763] [scope] [error] unable to disable CPU turbo: no permission. Run with higher privileges?
[2020-07-15 17:58:00.763] [scope] [error] unable to set OS CPU governor to maximum: no permission. Run with higher privileges?
```## FAQ / Troubleshooting
** I get `CMake Error: Remove failed on file: : System Error: Device or resource busy`**
This somtimes happens on network file systems. You can retry, or do the build on a local disk.
** I get `-- The CXX compiler identification is GNU 4.8.5` after `module load gcc/5.4.0`.
A different version of GCC may be in the CMake cache.
Try running `cmake -DCMAKE_CXX_COMPILER=g++ -DCMAKE_C_COMPILER=gcc`, or deleting your build directory and restarting.** I get `a PTX JIT compilation failed` **
set `CUDAFLAGS` to be the appropriate `-arch=sm_xx` for your system. e.g. `export CUDAFLAGS=-arch=sm_80` for ThetaGPU.
## Bumping the Version
Update the changelog and commit the changes.
Install bump2version
```pip install --user bump2version```
Check that everything seems good (minor version, for example)
```bump2version --dry-run minor --verbose```
Actually bump the version
```bump2version minor```
Push the changes
```git push && git push --tags```
## Contributing
Any work on the underlying `cwpearson/libscope` library will probably benefit from changing the submodule from http to SSH:
```
cd thirdparty/libscope
git remote set-url origin [email protected]:cwpearson/libscope.git
```## Contributors
* [Carl Pearson](mailto:[email protected])
* [Sarah Hashash](mailto:[email protected])# Changelog
## v0.12.0 (Aug 9 2023)
* cwpearson/libscope 124999dc0017b437adcbebeaded52cf9d973ac28
* improve compiler compatibility
* improve CMake support
* add device synchronize benchmarks
* add libc memcpy benchmark
* add HIP benchmarks## v0.11.2 (July 17 2020)
* cwpearson/libscope v1.1.2
* silence some warnings## v0.11.1 (July 17 2020)
* cwpearson/libscope v1.1.1## v0.11.0 (July 17 2020)
* cwpearson/libscope v1.1.0
* `cudaGraphInstantiate` and `cudaGraphLaunch`
* Reduce maximum `cudaMemcpyPeerAsync` size, since it is not truly async above ~2^27 which breaks the measurement strategy.## v0.10.0 (June 23 2020)
* Rely on `cwpearson/libscope` instead of `c3sr/scope`
* `cwpearson/libscope` v1.0.0
* Remove dependence on sugar
* Add 3D strided memory transfer benchmarks
* Add CUDA runtime microbenchmarks
* Remove some duplicate NUMA-/non-NUMA-aware implementations of cudaMemcpyAsync benchmarks## v0.9.0 (June 5 2020)
* Add CPU-GPU and GPU-GPU sparse data transfer benchmarks
* `cudaMemcpy3DAsync`
* `cudaMemcpy3DPeerAsync`
* `cudaMemcpy2DAsync`
* custom 3D kernel
* pack / `cudaMemcpyPeerAsync` / unpack## v0.8.2 (March 6 2020)
* Fix a event-device mismatch in multi-GPU unidirectional `cudaMemcpyPeer` benchmarks
## v0.8.1 (March 5 2020)
* Disable peer access in non-peer `cudaMemcpyPeer` benchmarks
## v0.8.0 (March 5 2020)
* Add `cudaMemcpyPeer` uni/bidirectional benchmarks.
## v0.7.2 (April 8 2019)
* Add memory to the clobber list for for x86 and ppc64le cache flushing.
## v0.7.1 (April 5 2019)
* Add v0.7.0 and v0.7.1 changelog
## v0.7.0 (April 5 2019)
* Make POWER's cache flushing code match the linux kernel.
* rename "Coherence" benchmarks to "Demand"
* remove cudaStreamSynchronize from the timing path of zerocopy-duplex, demand-duplex, and prefetch-duplex
* Transition to better use of CMake's CUDA language support
* Use NVCC's compiler defines to check the CUDA version
* Disable Comm|Scope by default during Scope compilation## v0.6.3 (Dec 20 2018)
* Add `USE_NUMA` CMake option
* Fix compile errors when USE_NUMA=0 or NUMA cannot be found## v0.6.2
* Fix checking non-existent cudaDeviceProp field in CUDA < 9
## v0.6.1
* Conform to updated SCOPE_REGSITER_AFTER_INIT
## v0.6.0
* Add unified memory allocation benchmarks
* Flush CPU caches in zero-copy benchmarks
* Add zerocopy duplex benchmarks
* Add unified memory prefetch duplex benchmark
* Add unified memory demand duplex benchmark
* Conform to updated SCOPE_REGSITER_AFTER_INIT## v0.5.0
* Add zero-copy benchmarks
* Don't use nvToolsExt## v0.4.0
* Add multithreaded Coherence GPU to Host benchmark
* Programatically register most benchmarks based on system configuration
* use cudaMemcpyAsync in numa-memcpy
* Add travis and Dockerfiles
* Use `aligned_alloc` in numa-memcpy/pinned-to-gpu
* Add x86 and POWER cache control functions## v0.3.0
* Rework documentation
* Use `target_include_scope_directories` and `target_link_scope_libraries`.
* Use Clara for flags.
* Remove numa/rd and numa/wr.## v0.2.0
* Add `--numa_ids` command line flag.
* Use `--cuda_device_ids` and --`numa_ids` to select CUDA and NUMA devices for benchmarks.