Jean Zay: Execution of an MPI multi-GPU job in batch
Jobs are managed on all of the nodes by the software Slurm.
To submit an MPI multi-GPU batch job, you must create a submission script by following the 2 examples below.
If your code requires the CUDA-aware MPI feature, you must refer to the CUDA-aware multi-GPU MPU GPUDirect GPUDirect page.
- Example for an execution using 3 GPUs, on the same node, of the default gpu partition:
- mpi_multi_gpu.slurm
#!/bin/bash #SBATCH --job-name=mpi_multi_gpu # Name of job # Other partitions are usable by activating/uncommenting # one of the 5 following directives: ##SBATCH -C v100-16g # uncomment to target only 16GB V100 GPU ##SBATCH -C v100-32g # uncomment to target only 32GB V100 GPU ##SBATCH --partition=gpu_p2 # uncomment for gpu_p2 partition (32GB V100 GPU) ##SBATCH -C a100 # uncomment for gpu_p5 partition (80GB A100 GPU) ##SBATCH -C h100 # uncomment for gpu_p6 partition (80GB H100 GPU) # Here, reservation of 3x10=30 CPUs (for 3 tasks) and 3 GPUs (1 GPU per task) on a single node: #SBATCH --nodes=1 # number of nodes #SBATCH --ntasks-per-node=3 # number of MPI tasks per node (= number of GPUs per node) #SBATCH --gres=gpu:3 # number of GPUs per node (max 8 with gpu_p2, gpu_p5) # The number of CPUs per task must be adapted according to the partition used. Knowing that here # only one GPU per task is reserved (i.e. 1/4 or 1/8 of the GPUs of the node depending on # the partition), the ideal is to reserve 1/4 or 1/8 of the CPUs of the node for each task: #SBATCH --cpus-per-task=10 # number of cores per task (1/4 of the 4-GPUs V100 node here) ##SBATCH --cpus-per-task=3 # number of cores per task for gpu_p2 (1/8 of 8-GPUs V100 node) ##SBATCH --cpus-per-task=8 # number of cores per task for gpu_p5 (1/8 of 8-GPUs A100 node) ##SBATCH --cpus-per-task=24 # number of cores per task for gpu_p6 (1/4 of 4-GPUs H100 node) # /!\ Caution, "multithread" in Slurm vocabulary refers to hyperthreading. #SBATCH --hint=nomultithread # hyperthreading deactivated #SBATCH --time=00:10:00 # maximum execution time requested (HH:MM:SS) #SBATCH --output=mpi_gpu_multi%j.out # name of output file #SBATCH --error=mpi_gpu_multi%j.out # name of error file (here, in common with the output file) # Cleans out modules loaded in interactive and inherited by default module purge # Uncomment the following module command if you are using the "gpu_p5" partition # to have access to the modules compatible with this partition. #module load arch/a100 # Uncomment the following module command if you are using the "gpu_p6" partition # to have access to the modules compatible with this partition. #module load arch/h100 # Loading modules module load ... # Echo of launched commands set -x # For the "gpu_p5" and "gpu_p6" partitions, the code must be compiled with the modules compatible # with the choosen partition. # Code execution with binding via bind_gpu.sh : 1 GPU per task srun /gpfslocalsup/pub/idrtools/bind_gpu.sh ./mpi_multi_gpu_exe
To launch a Python script, it is necessary to replace the last line with:
# Code execution with binding via bind_gpu.sh : 1 GPU per task srun /gpfslocalsup/pub/idrtools/bind_gpu.sh python -u mpi_multi_gpu.py
Comment: The Python option
-u
(= unbuffered) deactivates the buffering of standard outputs which are automatically effectuated by Slurm.
- Another example, for an execution using 8 GPUs, i.e. 2 complete nodes, of the default gpu partition:
- multi_gpu_mpi.slurm
#!/bin/bash #SBATCH --job-name=mpi_multi_gpu # name of job # Other partitions are usable by activating/uncommenting # one of the 5 following directives: ##SBATCH -C v100-16g # uncomment to target only 16GB V100 GPU ##SBATCH -C v100-32g # uncomment to target only 32GB V100 GPU ##SBATCH --partition=gpu_p2 # uncomment for gpu_p2 partition (32GB V100 GPU) ##SBATCH -C a100 # uncomment for gpu_p5 partition (80GB A100 GPU) ##SBATCH -C h100 # uncomment for gpu_p6 partition (80GB H100 GPU) # Here, reservation of 8x10=80 CPUs (4 tasks per node) and 8 GPUs (4 GPUs per node) on 2 nodes: #SBATCH --ntasks=8 # total number of MPI tasks (= total number of GPUs here) #SBATCH --ntasks-per-node=4 # number of MPI tasks per node (= number of GPUs per node) #SBATCH --gres=gpu:4 # number of GPUs per node (max 8 with gpu_p2, gpu_p5) # The number of CPUs per task must be adapted according to the partition used. Knowing that here # only one GPU per task is reserved (i.e. 1/4 or 1/8 of the GPUs of the node depending on # the partition), the ideal is to reserve 1/4 or 1/8 of the CPUs of the node for each task: #SBATCH --cpus-per-task=10 # number of cores per task (1/4 of the 4-GPUs V100 node here) ##SBATCH --cpus-per-task=3 # number of cores per task for gpu_p2 (1/8 of 8-GPUs V100 node) ##SBATCH --cpus-per-task=8 # number of cores per task for gpu_p5 (1/8 of 8-GPUs A100 node) ##SBATCH --cpus-per-task=24 # number of cores per task for gpu_p6 (1/4 of 4-GPUs H100 node) # /!\ Caution, "multithread" in Slurm vocabulary refers to hyperthreading. #SBATCH --hint=nomultithread # hyperthreading deactivated #SBATCH --time=00:10:00 # maximum execution time requested (HH:MM:SS) #SBATCH --output=mpi_gpu_multi%j.out # name of output file #SBATCH --error=mpi_gpu_multi%j.out # name of error file (here, in common with the output file) # Cleans out the modules loaded in interactive and inherited by default module purge # Uncomment the following module command if you are using the "gpu_p5" partition # to have access to the modules compatible with this partition. #module load arch/a100 # Uncomment the following module command if you are using the "gpu_p6" partition # to have access to the modules compatible with this partition. #module load arch/h100 # Loading of modules module load ... # Echo of launched commands set -x # For the "gpu_p5" and "gpu_p6" partitions, the code must be compiled with the modules compatible # with the choosen partition. # Code execution with binding via bind_gpu.sh : 1 GPU per task srun /gpfslocalsup/pub/idrtools/bind_gpu.sh ./mpi_multi_gpu_exe
To launch a Python script, it is necessary to replace the last line with:
# Code execution with binding via bind_gpu.sh : 1 GPU per task srun /gpfslocalsup/pub/idrtools/bind_gpu.sh python -u mpi_multi_gpu.py
Comment: The Python option
-u
(= unbuffered) deactivates the buffering of standard outputs which are automatically effectuated by Slurm.
Then submit this script via the sbatch
command:
$ sbatch mpi_multi_gpu.slurm
Comments:
- Similarly, you can execute your job on the gpu_p2 partition by specifying
--partition=gpu_p2
and--cpus-per-task=3
. - Similarly, you can execute your job on the gpu_p5 partition by specifying
-C a100
and--cpus-per-task=8
. Warning: the modules accessible by default are not compatible with this partition, it is necessary to first load thearch/a100
module to be able to list and load the compatible modules. For more information, see Modules compatible with gpu_p5 partition. - Similarly, you can execute your job on the gpu_p6 partition by specifying
-C h100
and--cpus-per-task=24
. Warning: the modules accessible by default are not compatible with this partition, it is necessary to first load thearch/h100
module to be able to list and load the compatible modules. For more information, see Modules compatible with gpu_p6 partition. - We recommend that you compile and execute your code in the same environment by loading the same modules.
- In this example, we assume that the
mpi_multi_gpu_exe
executable file is found in the submission directory which is the directory in which we enter thesbatch
command. - By default, Slurm buffers the standard outputs of a Python script, which can result in a significant delay between the script execution and the output visualisation in the logs. To deactivate this buffering, it is necessary to add the option
-u
(= unbuffered) to thepython
call. However, setting the PYTHONUNBUFFERED environment variable at 1 in your submission script (export PYTHONUNBUFFERED=1
) has the same effect. This variable is set by default in the virtual environments installed on Jean Zay by the support team. - The computation output file
mpi_gpu_multi<numero_job>.out
is also found in the submission directory. It is created at the very beginning of the job execution: Editing or modifying it while the job is running can disrupt the execution. - The
module purge
is made necessary by Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launchsbatch
will be passed to the submitted job. - To avoid errors from the automatic task allocation, we recommend that you use
srun
to execute your code instead ofmpirun
: This will guarantee a distribution consistent with the resources requested in your submission file. - The /gpfslocalsup/pub/idrtools/bind_gpu.sh script will associate a different GPU to each MPI task. It is not necessary to use it if your code explicitly manages the association of the MPI tasks to the GPUs. Be careful because, for the moment, this script is basic and can only manage a simple case of 1 GPU for 1 MPI task. Therefore, it only works if the number of MPI tasks is inferior or equal to 4 per node. If you have more complex requirements, please contact the IDRIS User Support Team.
- All jobs have resources defined in Slurm per partition and per QoS (Quality of Service) by default. You can modify the limits by specifying another partition and / or QoS as shown in our documentation detailing the partitions and Qos.
- For multi-project users and those having both CPU and GPU hours, it is necessary to specify the project accounting (hours allocation of the project) on which to count the computing hours of the job as indicated in our documentation detailing the project hours management.
- We strongly recommend that you consult our documentation detailing the project hours management to ensure that the hours consumed by your jobs are deducted from the correct accounting.