https://github.com/hunsa/reprompi
ReproMPI Benchmark for MPI Collective
https://github.com/hunsa/reprompi
benchmarking clock-synchronization collectives mpi
Last synced: about 1 year ago
JSON representation
ReproMPI Benchmark for MPI Collective
- Host: GitHub
- URL: https://github.com/hunsa/reprompi
- Owner: hunsa
- License: gpl-2.0
- Created: 2023-09-27T20:09:25.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-18T06:51:21.000Z (about 1 year ago)
- Last Synced: 2025-04-18T08:36:47.661Z (about 1 year ago)
- Topics: benchmarking, clock-synchronization, collectives, mpi
- Language: C
- Homepage:
- Size: 1.63 MB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog
- License: LICENSE
- Authors: AUTHORS
Awesome Lists containing this project
README
# ReproMPI Benchmark (Development Version)
## Introduction
The ReproMPI Benchmark is a tool designed to accurately measure the
run-time of MPI blocking collective operations. It provides multiple
process synchronization methods and a flexible mechanism for
predicting the number of measurements that are sufficient to obtain
statistically sound results.
# References
1. Sascha Hunold, Alexandra Carpen-Amarie:
On the Impact of Synchronizing Clocks and Processes on Benchmarking MPI Collectives. EuroMPI 2015: 8:1-8:10
2. Sascha Hunold, Alexandra Carpen-Amarie, Jesper Larsson Träff:
Reproducible MPI Micro-Benchmarking Isn't As Easy As You Think. EuroMPI/ASIA 2014: 69
3. Sascha Hunold, Alexandra Carpen-Amarie:
Reproducible MPI Benchmarking is Still Not as Easy as You Think. IEEE Trans. Parallel Distributed Syst. 27(12): 3617-3630 (2016)
4. Sascha Hunold, Alexandra Carpen-Amarie:
Hierarchical Clock Synchronization in MPI. CLUSTER 2018: 325-336
5. Sascha Hunold, Alexandra Carpen-Amarie:
Autotuning MPI Collectives using Performance Guidelines. HPC Asia 2018: 64-74
6. Joseph Schuchart, Sascha Hunold, George Bosilca:
Synchronizing MPI Processes in Space and Time. EuroMPI 2023: 7:1-7:11
## Components
- `mpibenchmark`: actual MPI benchmark for collectives
- [`pgchecker`](https://github.com/hunsa/reprompi/tree/main/src/pgcheck/): performance guideline checker
## Installation
- Prerequisites
- an MPI library
- CMake (version >= 3.0)
- GSL libraries
## Basic installation
```
cd $BENCHMARK_PATH
./cmake .
make
```
For specific configuration options check the *Benchmark Configuration* section.
## Running the ReproMPI Benchmark
The ReproMPI code is designed to serve two specific purposes:
## Benchmarking of MPI collective calls
The most common usage scenario of the benchmark is to specify an MPI
collective function to be benchmarked, a (list of) message sizes and
the *number of measurement repetitions* for each test, as in the
following example.
```
mpirun -np 4 ./bin/mpibenchmark --calls-list=MPI_Bcast,MPI_Allgather
--msizes-list=8,1024,2048 --nrep=10
```
## Command-line Options
### Common Options
- `-h` print help
- `-v` print run-times measured for each process
- `--msizes-list`= list of comma-separated message sizes in
Bytes, e.g., `--msizes-list=10,1024`
- `--msize-interval=min=,max=,step=` list of power
of 2 message sizes as an interval between $2^{min}$ and $2^{max}$,
with $2^{step}$ distance between values, e.g.,
`--msize-interval=min=1,max=4,step=1`
- `--calls-list=` list of comma-separated MPI calls to be
benchmarked, e.g., `--calls-list=MPI_Bcast,MPI_Allgather`
- `--root-proc=` root node for collective operations
- `--operation=` MPI operation applied by collective
operations (where applicable), e.g., `--operation=MPI_BOR`.
Supported operations: MPI_BOR, MPI_BAND, MPI_LOR, MPI_LAND,
MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD
- `--datatype=` MPI datatype used by collective
operations, e.g., `--datatype=MPI_CHAR`.
Supported datatypes: `MPI_CHAR`, `MPI_INT`, `MPI_FLOAT`, `MPI_DOUBLE`
- `--shuffle-jobs` shuffle experiments before running the benchmark
- `--params=k1:v1,k2:v2` list of comma-separated =key:value= pairs
to be printed in the benchmark output.
- `-f | --input-file=` input file containing the list of
benchmarking jobs (tuples of MPI function, message size, number of
repetitions). It replaces all the other common options.
### Options Related to the Window-based Synchronization
- `--window-size=` window size in microseconds for Window-based synchronization
### Specific Options for the ReproMPI Benchmark
- `--nrep=` set number of experiment repetitions
- `--summary=` list of comma-separated data summarizing
methods (mean, median, min, max, var, stddev), e.g., `--summary=mean,max`
## Supported Collective Operations:
### MPI Collectives
- `MPI_Allgather`
- `MPI_Allreduce`
- `MPI_Alltoall`
- `MPI_Barrier`
- `MPI_Bcast`
- `MPI_Exscan`
- `MPI_Gather`
- `MPI_Reduce`
- `MPI_Reduce_scatter`
- `MPI_Reduce_scatter_block`
- `MPI_Scan`
- `MPI_Scatter`
### Mockup Functions of Various MPI Collectives
| **MPI_Allgather** | **MPI_Allreduce** | **MPI_Alltoall** | **MPI_Bcast** | **MPI_Gather** | **MPI_Reduce** | **MPI_Reduce_scatter_block** | **MPI_Scan** | **MPI_Scatter** |
|-------------------|------------------------------|------------------|-------------------|----------------|---------------------------|------------------------------|--------------------|-----------------|
| Default | Default | Default | Default | Default | Default | Default | Default | Default |
| Allgatherv | Reduce+Bcast | Alltoallv | Allgatherv | Allgather | Allreduce | Reduce+Scatter | Exscan+Reducelocal | Bcast |
| Allreduce | Reducescatterblock+Allgather | Lane | Scatter+Allgather | Gatherv | Reducescatterblock+Gather | Reducescatter | Lane | Scatterv |
| Alltoall | Reducescatter+Allgatherv | | Lane | Reduce | Reducescatter+Gatherv | Allreduce | Hier | Lane |
| Gather+Bcast | Lane | | Hier | Lane | Reducescatter | Hier | | Hier |
| Lane | Hier | | | Hier | Lane | Lane | | |
| Lane Zero | | | | | Hier | | | |
| Hier | | |
## Process Synchronization Methods
### MPI_Barrier
This is the default synchronization method enabled for the benchmark.
### Dissemination Barrier
To benchmark collective operations acorss multiple MPI libraries using
the same barrier implementation, the benchmark provides a
dissemination barrier that can replace the default MPI_Barrier to
synchronize processes.
To enable the dissemination barrier, the following flag has to be set
before compiling the benchmark (e.g., using the =ccmake= command).
```
ENABLE_BENCHMARK_BARRIER
```
Both barrier-based synchronization methods can alternatively use a
double barrier before each measurement.
```
ENABLE_DOUBLE_BARRIER
```
### Window-based Synchronization
The ReproMPI benchmark implements a window-based process
synchronization mechanism, which estimates the clock offset/drift of
each process relative to a reference process and then uses the
obtained global clocks to synchronize processes before each
measurement and to compute run-times.
### Timing procedure
The MPI operation run-time is computed in a different manner
depending on the selected clock synchronization method. If global
clocks are available, the run-times are computed as the difference
between the largest exit time and the first start time among all
processes.
If a barrier-based synchronization is used, the run-time of an MPI
call is computed as the largest local run-time across all processes.
However, the timing proceduce that relies on global clocks can be
used in combination with a barrier-based synchronization when the
following flag is enabled:
### Clock resolution
The =MPI_Wtime= cll is used by default to obtain the current time.
To obtain accurate measurements of short time intervals, the benchmark
can rely on the high resolution =RDTSC/RDTSCP= instructions (if they are
available on the test machines) by setting on of the following flags:
```
ENABLE_RDTSC
ENABLE_RDTSCP
```
Additionally, setting the clock frequency of the CPU is required to
obtain accurate measurements:
```
FREQUENCY_MHZ 2300
```
The clock frequency can also be automatically estimated (as done by
the NetGauge tool) by enabling the following variable:
```
CALIBRATE_RDTSC
```
However, this method reduces the results accuracy and we advise to
manually set the highest CPU frequency instead. More details about
the usage of =RDTSC=-based timers can be found in our research
report.
## List of Compilation Flags
This is the full list of compilation flags that can be used to control
all the previously detailed configuration parameters.
```
CALIBRATE_RDTSC OFF
COMPILE_BENCH_TESTS OFF
COMPILE_SANITY_CHECK_TESTS OFF
ENABLE_BENCHMARK_BARRIER OFF
ENABLE_DOUBLE_BARRIER OFF
ENABLE_GLOBAL_TIMES OFF
ENABLE_LOGP_SYNC OFF
ENABLE_RDTSC OFF
ENABLE_RDTSCP OFF
ENABLE_WINDOWSYNC_HCA OFF
ENABLE_WINDOWSYNC_JK OFF
ENABLE_WINDOWSYNC_SK OFF
FREQUENCY_MHZ 2300
```
## Clock Synchronization Algorithms
### HCA [1]
### HCA2 [1]
### HCA3 [4]
### Topo1 [4]
### Topo2 [4]
- two-level hierarchical clock-sync
- top level for sync between nodes
- bottom level on compute node
- default
- top: HCA3
- bottom: ClockPropagation