
Table des matières
Benchmark for Gromacs 6vnx case
Benchmark description
System
The system consists of a protein in a box of water:
- Number of atoms:
- Total = 932,310
- Solvent = 887,139
- Protein = 45,156
- Counter ions = 15
- Box:
- cubic 21.14 Å
Simulation
The simulation is run with Gromacs 2020.1 and has the following characteristics:
- Type of simulation: molecular dynamics
- Ensemble: NVT
- Number of steps: 500,000
- Timestep : 2 fs
- Long range electrostatic : Particle Mesh Ewald
- PME cutoff: 1.2 Å
- van der Waals cutoff: 1.1 Å
- cutoff-scheme: Verlet
- Temperature coupling: V-rescale
- Constraints: LINCS on H-bonds
Benchmark
The benchmark is performed using JUBE.
Gromacs has a rather large number of degrees of freedom when benchmarking. So it is plausible that a different fine-tuned configuration might give a better result. We tried to be as general as possible.
A first benchmark was performed on a small number of steps (50,000), then the best configurations obtained were re-run for 500,000 time steps to reduce the impact of the first slower steps.
CPU benchmark
The default number of PME/PP tasks is kept as default. Please note that no large imbalance occurred during the benchmarks. This is something you might want to check when doing your tests.
Here is an example of the mdrun command for the thread-MPI version of Gromacs:
gmx mdrun -ntmpi 40 -ntomp 1 \ -dlb yes -update auto -bonded auto \ -nb auto -pme auto -pmefft auto \ -deffnm 6vxx_nvt -v \ -nsteps 500000 -resetstep 300000
GPU benchmark
The GPU benchmark was done to check the best results for:
- 1 quarter of a node (1 GPU, 10 physical cores)
- 1 half node (2 GPUs, 20 physical cores)
- 1 node (4 GPUs, 40 physical cores)
The idea was to offload the maximum computation possible. So the following choices have been made:
- Offload PME: This means that the number of PME tasks has to be set to 1 (-npme 1) since PME on GPU does not support domain decomposition.
- For the thread-MPI version, the newest developments available since 2020 version are offloaded to GPU. Please read this page before doing a production run
- Update of neighbors list
- Bonded interactions
The thread-MPI version is limited to running on 1 node, meaning a maximum of 4 GPUs. This means that a large imbalance might occur. But this is still the best choice for performance.
Here is an example of the mdrun command for GPU thread-MPI version of Gromacs:
export GMX_GPU_PME_PP_COMMS=true export GMX_FORCE_UPDATE_DEFAULT_GPU=1 export GMX_GPU_DD_COMMS=true gmx mdrun -ntmpi 5 -npme 1 -ntomp 4 \ -dlb yes -update gpu -bonded gpu \ -nb gpu -pme gpu -pmefft gpu \ -deffnm 6vxx_nvt -v \ -nsteps 500000 -resetstep 300000
Results
The performance is measured on the 200,000 last steps.
Benchmark of best results (500,000 steps)
A few indicators such as the imbalance and the PME/F ratio are indicated.
The multithread column indicates if the hyperthreading was used during the computation.
nodes | ntasks | nthreads | multithread | ngpus | time (s) | perf (ns/day) | imbalance | pme/F |
---|---|---|---|---|---|---|---|---|
1 | 40 | 1 | nomultithread | 0 | 7082.424 | 4.88 | 0.6 | 0.88 |
1 | 5 | 4 | multithread | 1 | 2344.079 | 14.744 | 0.8 | 0.03 |
1 | 4 | 5 | nomultithread | 2 | 1478.937 | 23.368 | 2.2 | 0.02 |
1 | 4 | 10 | nomultithread | 4 | 851.949 | 40.566 | 0.2 | 0.03 |
Table of speedup:
Number of GPUs | perf (ns/day) | Speedup |
---|---|---|
0 | 4.88 | 1.0 |
1 | 14.744 | 3.0 |
2 | 23.368 | 4.8 |
4 | 40.566 | 8.3 |
First benchmark (50,000 steps)
When the performance is -1.0, it means that the run did not finish. Most of the time this is due to a segmentation fault in the program.
Some indicators such as the imbalance and the PME/F ratio are shown.
The multithread column indicates if hyperthreading was used during the computation.
nodes | ntasks | nthreads | multithread | ngpus | time (s) | perf (ns/day) | imbalance | pme/F |
---|---|---|---|---|---|---|---|---|
1 | 5 | 8 | nomultithread | 0 | 489.09 | 3.533 | 0.5 | |
1 | 10 | 4 | nomultithread | 0 | 469.594 | 3.68 | 0.8 | |
1 | 8 | 5 | nomultithread | 0 | 437.289 | 3.952 | 0.4 | |
1 | 20 | 2 | nomultithread | 0 | 381.868 | 4.526 | 0.7 | 1.02 |
1 | 40 | 1 | nomultithread | 0 | 365.068 | 4.734 | 0.7 | 0.87 |
1 | 40 | 1 | nomultithread | 4 | -1.0 | |||
1 | 40 | 2 | multithread | 4 | -1.0 | |||
1 | 32 | 1 | nomultithread | 4 | -1.0 | |||
1 | 32 | 2 | multithread | 4 | -1.0 | |||
1 | 20 | 2 | nomultithread | 4 | -1.0 | |||
1 | 20 | 4 | multithread | 4 | -1.0 | |||
1 | 16 | 2 | nomultithread | 4 | -1.0 | 0.02 | ||
1 | 16 | 4 | multithread | 4 | -1.0 | 0.02 | ||
1 | 12 | 3 | nomultithread | 4 | -1.0 | 0.06 | ||
1 | 12 | 6 | multithread | 4 | -1.0 | 0.06 | ||
1 | 8 | 4 | nomultithread | 4 | -1.0 | 0.05 | ||
1 | 8 | 8 | multithread | 4 | -1.0 | 0.05 | ||
1 | 10 | 1 | nomultithread | 1 | -1.0 | 0.03 | ||
1 | 10 | 2 | multithread | 1 | -1.0 | 0.04 | ||
1 | 20 | 1 | nomultithread | 2 | -1.0 | |||
1 | 20 | 2 | multithread | 2 | -1.0 | |||
1 | 10 | 2 | nomultithread | 2 | -1.0 | 0.04 | ||
1 | 10 | 4 | multithread | 2 | -1.0 | 0.04 | ||
1 | 2 | 5 | nomultithread | 1 | 132.988 | 12.995 | ||
1 | 4 | 2 | nomultithread | 1 | 128.825 | 13.415 | 0.7 | 0.03 |
1 | 2 | 10 | multithread | 1 | 125.646 | 13.754 | ||
1 | 5 | 2 | nomultithread | 1 | 124.236 | 13.91 | 0.5 | 0.03 |
1 | 4 | 4 | multithread | 1 | 120.234 | 14.373 | 0.3 | 0.03 |
1 | 5 | 4 | multithread | 1 | 117.625 | 14.692 | 0.5 | 0.02 |
1 | 2 | 10 | nomultithread | 2 | 107.217 | 16.118 | ||
1 | 2 | 20 | multithread | 2 | 103.69 | 16.667 | ||
1 | 4 | 5 | nomultithread | 2 | 75.724 | 22.822 | 0.5 | 0.02 |
1 | 4 | 10 | multithread | 2 | 73.035 | 23.662 | 3.4 | 0.03 |
1 | 4 | 10 | nomultithread | 4 | 44.863 | 38.521 | 0.2 | 0.03 |