Training on Jean Zay
See the wiki for more details.
Run a training job
There is no HTTP connection during a job. |
You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i
.
You should run the actual training using a passive/batch job:
-
Run
sbatch train_dan.sh
. -
The
train_dan.sh
file should look like the example below.
#!/bin/bash
#SBATCH --constraint=v100-32g
#SBATCH --qos=qos_gpu-t4 # partition
#SBATCH --job-name=dan_training # name of the job
#SBATCH --gres=gpu:1 # number of GPUs per node
#SBATCH --cpus-per-task=10 # number of cores per tasks
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --distribution=block:block # we pin the tasks on contiguous cores
#SBATCH --nodes=1 # number of nodes
#SBATCH --ntasks-per-node=1 # number of MPI tasks per node
#SBATCH --time=99:00:00 # max exec time
#SBATCH --output=dan_train_hugin_munin_page_%j.out # output log file
#SBATCH --error=dan_train_hugin_munin_page_%j.err # error log file
module purge # purging modules inherited by default
module load anaconda-py3
conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/
# print started commands
set -x
# execution
teklia-dan train
Train on multiple GPUs
To train on multiple GPUs, one needs to update the parameters in the training configuration file, as detailed in the dedicated page. In addition, the number of GPUs required must be specified in the train_dan.sh
file by updating the following line:
#SBATCH --gres=gpu:<nb_gpus> # number of GPUs per node
Supervise a job
-
Use
squeue -u $USER
. This command should give an output similar to the one presented below.
(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 1762916 gpu_p13 pylaia_t ubz97wr R 23:07:54 1 r7i6n1 1762954 gpu_p13 pylaia_t ubz97wr R 22:35:57 1 r7i3n1