Table des matières
Python scripts for automated execution of GPU jobs
Automated scripts are available to users of Jean Zay for the execution of GPU jobs via the SLURM job manager. These scripts were created for use in a notebook opened on a front end for executing jobs distributed on the GPU computing nodes.
The scripts are developed by the IDRIS support team and installed in all the PyTorch and TensorFlow modules.
Importing the functions:
from idr_pytools import gpu_jobs_submitter, display_slurm_queue, search_log
Submission of GPU jobs
The gpu_jobs_submitter script enables the submission of GPU jobs into the SLURM queue. It automates the creation of SLURM files which meet our requirements and it submits the jobs for execution via the sbatch
command.
The automatically created SLURM files are found in the Slurm folder. You can consult them to check the configuration.
Arguments :
- srun_commands (obligatory): The command to execute with
srun
; for AI, this is often a Python script. Example:'my_script.py -b 64 -e 6 --learning-rate 0.001
'. If the first word is a file with a .py extension, thepython -u
command is automatically added before the script name. It is also possible to indicate a list of commands in order to submit more than one job. Example:['my_script.py -b 64 -e 6 --learning-rate 0.001', 'my_script.py -b 128 -e 12 --learning-rate 0.01']
. - n_gpu: The number of GPUs to reserve for a job. By default 1 GPU, and a maximum of 512 GPUs. It is also possible to indicate a list of the number of GPUs. Example:
n_gpu=[1, 2, 4, 8, 16]
. In this way, a job will be created for each element in the list. If more than one command is specified in the precedingsrun_commands
argument, each command will be run on all of the requested configurations. - module (required if using the modules): Name of the module to load. Only one module name is authorised.
- singularity (required if using a Singularity container): Name of the SIF image to load. The
idrcontmgr
command will have previously been applied. See the documentation on using Singularity containers. - name: Name of the job. It will be displayed in the SLURM queue and integrated in the log names. By default, the python script name indicated in
srun_commands
is used. - n_gpu_per_task: The number of GPUs associated with a task. By default, this is 1 GPU per task, in accordance with data parallelism configuration. However, for model parallelism or for the Tensorflow distribution strategies, it will be necessary to associate more than one GPU to a task.
- time_max: The maximum duration of the job. By default:
'02:00:00
'. - qos: The default QoS is
'qos_gpu-t3
'. If not using the default QoS, use'qos_gpu-t4
' or'qos_gpu-dev
'. - partition: The default partition is
'gpu_p13
'. If not using the default partition, use'gpu_p2
','gpu_p2l
' or'gpu_p2s
'. - constraint:
'v100-32g
' or'v100-16g
'. When you use the default partition, it allows forcing the use of the 32GB GPUs or 16GB GPUs. - cpus_per_task: The number of CPUs to associate with each task. By default: 10 for the default partition or 3 for the
gpu_p2
partition. It is advised to leave the default values. - exclusive: Forces the exclusive use of a node.
- account: The GPU hour accounting to use. Obligatory if you have more than one hour accounting and/or project. For more information, you can refer to our documentation about project hours management.
- verbose:
0
by default. The value1
adds traces of NVIDIA debugging in the logs. - email: The email address for status reports of your jobs to be sent automatically by SLURM.
- addon: Enables adding additional command lines to the SLURM file; for example,
'unset PROXY
', or to load a personal environment:addon="""source .bashrc conda activate myEnv"""
Return:
- jobids: List of the jobids of submitted jobs.
Note for A100:
- To use the A100 80GB partition with your
xxx@a100
account, you just need to specify it withaccount=xxx@a100
. Then, the addition of the constraint and the module necessary to use this partition will be automatically integrated into the generated.slurm
file.
Example
- Command launched:
jobids = gpu_jobs_submitter(['my_script.py -b 64 -e 6 --learning-rate 0.001', 'my_script.py -b 128 -e 12 --learning-rate 0.01'], n_gpu=[1, 2, 4, 8, 16, 32, 64], module='tensorflow-gpu/py3/2.4.1', name="Imagenet_resnet101")
- Display:
batch job 0: 1 GPUs distributed on 1 nodes with 1 tasks / 1 gpus per node and 3 cpus per task Submitted batch job 778296 Submitted batch job 778297 batch job 2: 2 GPUs distributed on 1 nodes with 2 tasks / 2 gpus per node and 3 cpus per task Submitted batch job 778299 Submitted batch job 778300 batch job 4: 4 GPUs distributed on 1 nodes with 4 tasks / 4 gpus per node and 3 cpus per task Submitted batch job 778301 Submitted batch job 778302 batch job 6: 8 GPUs distributed on 1 nodes with 8 tasks / 8 gpus per node and 3 cpus per task Submitted batch job 778304 Submitted batch job 778305 batch job 8: 16 GPUs distributed on 2 nodes with 8 tasks / 8 gpus per node and 3 cpus per task Submitted batch job 778306 Submitted batch job 778307 batch job 10: 32 GPUs distributed on 4 nodes with 8 tasks / 8 gpus per node and 3 cpus per task Submitted batch job 778308 Submitted batch job 778309 batch job 12: 64 GPUs distributed on 8 nodes with 8 tasks / 8 gpus per node and 3 cpus per task Submitted batch job 778310 Submitted batch job 778312
Interactive display of the SLURM queue
It is possible to display the SLURM queue with the waiting jobs on a notebook with the following command:
!squeue -u $USER
However, this only displays the present status.
The display_slurm_queue function enables having a dynamic display of the queue, refreshed every 5 seconds. This function only stops when the queue is empty which is useful in a notebook for having a sequential execution of the cells. If the jobs last too long, it is possible to stop the execution of the cell (without impact on the SLURM queue) and take control of the notebook.
Arguments:
- name: Enables a filter by job name. The queue will only display jobs with this name.
- timestep: Refresh delay. By default, 5 seconds.
Example
- Command run:
display_slurm_queue("Imagenet_resnet101")
- Display:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 778312 gpu_p2 Imagenet ssos040 PD 0:00 8 (Priority) 778310 gpu_p2 Imagenet ssos040 PD 0:00 8 (Priority) 778309 gpu_p2 Imagenet ssos040 PD 0:00 4 (Priority) 778308 gpu_p2 Imagenet ssos040 PD 0:00 4 (Priority) 778307 gpu_p2 Imagenet ssos040 PD 0:00 2 (Priority) 778306 gpu_p2 Imagenet ssos040 PD 0:00 2 (Priority) 778305 gpu_p2 Imagenet ssos040 PD 0:00 1 (Priority) 778304 gpu_p2 Imagenet ssos040 PD 0:00 1 (Priority) 778302 gpu_p2 Imagenet ssos040 PD 0:00 1 (Priority) 778301 gpu_p2 Imagenet ssos040 PD 0:00 1 (Priority) 778296 gpu_p2 Imagenet ssos040 PD 0:00 1 (Resources) 778297 gpu_p2 Imagenet ssos040 PD 0:00 1 (Resources) 778300 gpu_p2 Imagenet ssos040 PD 0:00 1 (Resources) 778299 gpu_p2 Imagenet ssos040 R 1:04 1 jean-zay-ia828
Search log paths
The search_log function enables finding the paths to the log files of the jobs executed with the gpu_jobs_submitter function.
The log files have a name with the following format: '{name}@JZ_{datetime}_{ntasks}tasks_{nnodes}nodes_{jobid}
'.
Arguments:
- name: Enables filtering by job name.
- contains: Enables filtering by date, number of tasks, number of nodes or jobids. The character '*' enables contenating more than one filter. Example:
contains='2021-02-12_22:*1node
' - with_err: By default, False. If True, returns a dictionary with the paths of both the output files and the error files listed in chronological order. If False, returns a list with only the paths of the output files listed in chronological order.
Example:
- Command run:
paths = search_log("Imagenet_resnet101")
- Display:
['./slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:46_8tasks_4nodes_778096.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:49_8tasks_4nodes_778097.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:53_8tasks_4nodes_778099.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:23:57_8tasks_8nodes_778102.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:24:04_8tasks_8nodes_778105.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-01_11:24:10_8tasks_8nodes_778110.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-07_17:53:49_2tasks_1nodes_778310.out', './slurm/log/Imagenet_resnet101@JZ_2021-04-07_17:53:52_2tasks_1nodes_778312.out']
- Comment: The paths are listed in chronological order. If you want the last 2 paths, you simply need to use the following command:
paths[-2:]