Table des matières
Jean Zay: CUDA-aware MPI and GPUDirect
For optimal performance, the Cuda-aware OpenMPI libraries supporting the GPUDirect are available on Jean Zay.
These MPI libraries effectuate communications using send and receive buffers allocated on the GPU memory. Thanks to the GPUDirect support, these transfers are done directly from GPU to GPU whenever possible, without an intermediate copy on the CPU memory.
Warning: GPUDirect cannot be used on Jean Zay for codes that use multiple GPUs per MPI rank.
Code compilation
It is necessary to compile the code by using one of the CUDA-aware MPI libraries available on Jean Zay.
After loading the compiler you want to use, one of the following modules should be loaded:
$ module avail openmpi/*-cuda -------------- /gpfslocalsup/pub/modules-idris-env4/modulefiles/linux-rhel8-skylake_avx512 -------------- openmpi/3.1.4-cuda openmpi/3.1.6-cuda openmpi/4.0.2-cuda openmpi/4.0.4-cuda $ module load openmpi/4.0.4-cuda
If OpenMPI has not been compiled yet for the chosen compiler, an error message will be displayed. Please contact the support team which can install a new version.
You can use the module show openmpi/<version>
command to display the list of compilers for which OpenMPI has already been compiled. For example:
$ module show openmpi/4.0.2-cuda ------------------------------------------------------------------ /gpfslocalsup/pub/modules-idris-env4/modulefiles/linux-rhel8-skylake_avx512/openmpi/4.0.2-cuda: module-whatis {An open source Message Passing Interface implementation.} prereq intel-compilers/19.0.4 pgi/20.1 pgi/19.10 gcc/10.1.0 gcc/8.3.1 conflict openmpi conflict intel-mpi Available software environment(s): - intel-compilers/19.0.4 - pgi/20.1 - pgi/19.10 - gcc/10.1.0 - gcc/8.3.1 If you want to use this module with another software environment, please contact the support team. -------------------------------------------------------------------
The compilation is done by using OpenMPI wrappers:
$ mpifort source.f90 $ mpicc source.c $ mpic++ source.C
Since no particular option is necessary for the compilation, you may refer to the GPU compilation section of the index for more information on code compiling using the GPUs.
Adaptation of the code
The utilisation of the CUDA-aware MPI GPUDirect functionality on Jean Zay requires an accurate initialisation order for CUDA or OpenACC and MPI in the code :
- Initialisation of CUDA or OpenACC
- Choice of the GPU which each MPI process should use (binding step)
- Initialisation of MPI.
Caution: if this initialisation order is not respected, your code execution might crash with the following error:
CUDA failure: cuCtxGetDevice()
Therefore, a slight adjustment of your code could be needed in order for you to use this functionality on Jean Zay.
You will find below a subroutine example in Fortran and in C which allows the OpenACC initialisation before the MPI initialisation, then a CUDA example.
Important: This example only works when exactly one MPI process is assigned per GPU.
OpenACC example
Fortran version
- init_acc.f90
#ifdef _OPENACC subroutine initialisation_openacc USE openacc character(len=6) :: local_rank_env integer :: local_rank_env_status, local_rank ! Initialisation of OpenACC !$acc init ! Recovery of the local rank of the process via the environment variable ! set by Slurm, as MPI_Comm_rank cannot be used here ! because this routine is used BEFORE the initialisation of MPI call get_environment_variable(name="SLURM_LOCALID", value=local_rank_env, status=local_rank_env_status) if (local_rank_env_status == 0) then read(local_rank_env, *) local_rank ! Define the GPU to use via OpenACC call acc_set_device_num(local_rank, acc_get_device_type()) else print *, "Error: impossible to determine the local rank of the process" stop 1 end if end subroutine initialisation_openacc #endif
Example of usage:
- init_acc_mpi.f90
! Initialise OpenACC... #ifdef _OPENACC call initialisation_openacc #endif ! ... before initialising MPI call mpi_init(code)
C Version
- init_acc.c
#ifdef _OPENACC void initialisation_openacc() { char* local_rank_env; int local_rank; /* Initialisation of OpenACC */ #pragma acc init /* Recovery of the local rank of the process via the environment variable set by Slurm, as MPI_Comm_rank cannot be used here because this routine is used BEFORE the initialisation of MPI*/ local_rank_env = getenv("SLURM_LOCALID"); if (local_rank_env) { local_rank = atoi(local_rank_env); /* Define the GPU to use via OpenACC */ acc_set_device_num(local_rank, acc_get_device_type()); } else { printf("Error : impossible to determine the local rank of the process\n"); exit(1); } } #endif
Example of usage :
- init_acc_mpi.c
#ifdef _OPENACC /* Initialise OpenACC... */ initialisation_openacc(); #endif /* ... before initialising MPI */ MPI_Init(&argc, &argv);
CUDA example
- init_cuda.c
#include <cuda.h> #include <cuda_runtime.h> void initialisation_cuda() { char* local_rank_env; int local_rank; cudaError_t cudaRet; /* Recovery of the local rank of the process via the environment variable set by Slurm, as MPI_Comm_rank cannot be used here because this routine is used BEFORE the initialisation of MPI*/ local_rank_env = getenv("SLURM_LOCALID"); if (local_rank_env) { local_rank = atoi(local_rank_env); /* Define the GPU to use for each MPI process */ cudaRet = cudaSetDevice(local_rank); if(cudaRet != CUDA_SUCCESS) { printf("Erreur: cudaSetDevice has failed\n"); exit(1); } } else { printf("Error : impossible to determine the local rank of the process\n"); exit(1); } }
Example of usage :
- init_cuda_mpi.c
/* Initialise CUDA... */ initialisation_cuda(); /* ... before initialising MPI */ MPI_Init(&argc, &argv);
Execution of the code
For code execution, you need to be sure to load the same MPI library as the one used for the code compilation by using the module
command, and then use the srun
command to begin the execution.
The support of CUDA-aware and GPUDirect is thus activated by default without any supplementary operation.
For more information about submission of multi-GPU jobs using MPI, consult the page Execution of a multi-GPU Cuda-aware MPI and GPU job in batch.