The ReproMPI Benchmark is a tool designed to accurately measure the run-time of MPI blocking collective operations. It provides multiple process synchronization methods and a flexible mechanism for predicting the number of measurements that are sufficient to obtain statistically sound results.
- an MPI library
- CMake (version >= 2.6)
- GSL libraries
cd $BENCHMARK_PATH ./cmake . make
For specific configuration options check the Benchmark Configuration section.
The ReproMPI code is designed to serve two specific purposes:
The most common usage scenario of the benchmark is to specify an MPI collective function to be benchmarked, a (list of) message sizes and the number of measurement repetitions for each test, as in the following example.
mpirun -np 4 ./bin/mpibenchmark --calls-list=MPI_Bcast,MPI_Allgather --msizes-list=8,1024,2048 --nrep=10
In this scenario, the user can generate an estimation of the number of measurements required for a stable result with respect to one or multiple prediction methods.
More details about the various methods that are supported and their usage can be found in:
- S. Hunold, A. Carpen-Amarie, F.D. Lübbe and J.L. Träff, “Automatic Verification of Self-Consistent MPI Performance Guidelines”, EuroPar (2016)
This is an example of how to perform such an estimation:
mpirun -np 4 ./bin/mpibenchmarkPredNreps --calls-list=MPI_Bcast,MPI_Allgather --msizes-list=8,1024,2048 --rep-prediction=min=1,max=200,step=5
-h
print help-v
print run-times measured for each process--msizes-list=<values>
list of comma-separated message sizes in Bytes, e.g.,--msizes-list=10,1024
--msize-interval=min=<min>,max=<max>,step=<step>
list of power of 2 message sizes as an interval between 2^min and 2^max, with 2^step distance between values, e.g.,--msize-interval=min=1,max=4,step=1
--calls-list=<args>
list of comma-separated MPI calls to be benchmarked, e.g.,--calls-list=MPI_Bcast,MPI_Allgather
--root-proc=<process_id>
root node for collective operations--operation=<mpi_op>
MPI operation applied by collective operations (where applicable), e.g.,--operation=MPI_BOR
.Supported operations: MPI_BOR, MPI_BAND, MPI_LOR, MPI_LAND, MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD
--datatype=<mpi_type>
MPI datatype used by collective operations, e.g.,--datatype=MPI_CHAR
.Supported datatypes: MPI_CHAR, MPI_INT, MPI_FLOAT, MPI_DOUBLE
--shuffle-jobs
shuffle experiments before running the benchmark--params=k1:v1,k2:v2
list of comma-separatedkey:value
pairs to be printed in the benchmark output.-f | --input-file=<path>
input file containing the list of benchmarking jobs (tuples of MPI function, message size, number of repetitions). It replaces all the other common options.
--window-size=<win>
window size in microseconds for Window-based synchronization
Specific options for synchronization methods based on a linear model of the clock drift
--fitpoints=<nfit>
number of fitpoints (default: 20)--exchanges=<nexc>
number of exchanges (default: 10)
--nrep=<nrep>
set number of experiment repetitions--summary=<args>
list of comma-separated data summarizing methods (mean, median, min, max), e.g.,--summary=mean,max
--rep-prediction=min=<min>,max=<max>,step=<step>
set the total number of repetitions to be estimated between<min>
and<max>
, so that at each iteration i, the number of measurements (nrep
) is eithernrep(0) = <min>
, ornrep(i) = nrep(i-1) + <step> * 2^(i-1)
, e.g.,--rep-prediction=min=1,max=4,step=1
--pred-method=m1,m2
comma-separated list of prediction methods, i.e., rse, cov_mean, cov_median (default: rse)--var-thres=thres1,thres2
comma-separated list of thresholds corresponding to the specified prediction methods (default: 0.01)--var-win=win1,win2
comma-separated list of (non-zero) windows corresponding to the specified prediction methods;rse
does not rely on a measurement window, however a dummy window value is required in this list when multiple methods are used (default: 10)
- MPI_Allgather
- MPI_Allreduce
- MPI_Alltoall
- MPI_Barrier
- MPI_Bcast
- MPI_Exscan
- MPI_Gather
- MPI_Reduce
- MPI_Reduce_scatter
- MPI_Reduce_scatter_block
- MPI_Scan
- MPI_Scatter
- GL_Allgather_as_Allreduce
- GL_Allgather_as_Alltoall
- GL_Allgather_as_GatherBcast
- GL_Allreduce_as_ReduceBcast
- GL_Allreduce_as_ReducescatterAllgather
- GL_Allreduce_as_ReducescatterblockAllgather
- GL_Bcast_as_ScatterAllgather
- GL_Gather_as_Allgather
- GL_Gather_as_Reduce
- GL_Reduce_as_Allreduce
- GL_Reduce_as_ReducescatterGather
- GL_Reduce_as_ReducescatterblockGather
- GL_Reduce_scatter_as_Allreduce
- GL_Reduce_scatter_as_ReduceScatterv
- GL_Reduce_scatter_block_as_ReduceScatter
- GL_Scan_as_ExscanReducelocal
- GL_Scatter_as_Bcast
This is the default synchronization method enabled for the benchmark.
To benchmark collective operations acorss multiple MPI libraries using the same barrier implementation, the benchmark provides a dissemination barrier that can replace the default MPI_Barrier to synchronize processes.
To enable the dissemination barrier, the following flag has to be set
before compiling the benchmark (e.g., using the ccmake
command).
ENABLE_BENCHMARK_BARRIER
Both barrier-based synchronization methods can alternatively use a double barrier before each measurement.
ENABLE_DOUBLE_BARRIER
The ReproMPI benchmark implements a window-based process synchronization mechanism, which estimates the clock offset/drift of each process relative to a reference process and then uses the obtained global clocks to synchronize processes before each measurement and to compute run-times.
It relies on one of the following clock synchronization methods:
- HCA synchronization: this is the clock synchronization algorithm we propose in []. It computes a linear model of the clock drift of each process. The HCA method can be configured by setting the following flags before compilation.
ENABLE_WINDOWSYNC_HCA ENABLE_LOGP_SYNC
The ENABLE_LOGP_SYNC
flag determines which variant of the HCA
algorithm is used, i.e., either HCA1 (which computes the clock models
in O(p) steps) or HCA2 (which requires only O(log p) rounds).
- SKaMPI synchronization: it implements the SKaMPI clock synchronization algorithm. To enable it, set the following flag before compilation.
ENABLE_WINDOWSYNC_SK
- Jones and Koenig synchronization: it implements the clock synchronization algorithm introduced by Jones and Koenig~[]. To enable it, set the following flag before compilation.
ENABLE_WINDOWSYNC_JK
The MPI operation run-time is computed in a different manner depending on the selected clock synchronization method. If global clocks are available, the run-times are computed as the difference between the largest exit time and the first start time among all processes.
If a barrier-based synchronization is used, the run-time of an MPI call is computed as the largest local run-time across all processes.
However, the timing proceduce that relies on global clocks can be used in combination with a barrier-based synchronization when the following flag is enabled:
ENABLE_GLOBAL_TIMES
More information regarding the timing procedure can be found in [].
The MPI_Wtime
cll is used by default to obtain the current time.
To obtain accurate measurements of short time intervals, the benchmark
can rely on the high resolution RDTSC/RDTSCP
instructions (if they are
available on the test machines) by setting on of the following flags:
ENABLE_RDTSC ENABLE_RDTSCP
Additionally, setting the clock frequency of the CPU is required to obtain accurate measurements:
FREQUENCY_MHZ 2300
The clock frequency can also be automatically estimated (as done by the NetGauge tool) by enabling the following variable:
CALIBRATE_RDTSC
However, this method reduces the results accuracy and we advise to
manually set the highest CPU frequency instead. More details about
the usage of RDTSC
-based timers can be found in our research
report[].
This is the full list of compilation flags that can be used to control all the previously detailed configuration parameters.
CALIBRATE_RDTSC OFF COMPILE_BENCH_TESTS OFF COMPILE_PRED_BENCHMARK ON COMPILE_SANITY_CHECK_TESTS OFF ENABLE_BENCHMARK_BARRIER OFF ENABLE_DOUBLE_BARRIER OFF ENABLE_GLOBAL_TIMES OFF ENABLE_LOGP_SYNC OFF ENABLE_RDTSC OFF ENABLE_RDTSCP OFF ENABLE_WINDOWSYNC_HCA OFF ENABLE_WINDOWSYNC_JK OFF ENABLE_WINDOWSYNC_SK OFF FREQUENCY_MHZ 2300