Training on Jean Zay

See the wiki for more details.

Run a training job

There is no HTTP connection during a job.

You can debug using an interactive job. The following command will get you a new terminal with 1 gpu for 1 hour: srun --ntasks=1 --cpus-per-task=40 --gres=gpu:1 --time=01:00:00 --qos=qos_gpu-dev --pty bash -i.

You should run the actual training using a passive/batch job:

  • Run sbatch train_dan.sh.

  • The train_dan.sh file should look like the example below.

#!/bin/bash
#SBATCH --constraint=v100-32g
#SBATCH --qos=qos_gpu-t4                # partition
#SBATCH --job-name=dan_training         # name of the job
#SBATCH --gres=gpu:1                    # number of GPUs per node
#SBATCH --cpus-per-task=10              # number of cores per tasks
#SBATCH --hint=nomultithread            # we get physical cores not logical
#SBATCH --distribution=block:block      # we pin the tasks on contiguous cores
#SBATCH --nodes=1                       # number of nodes
#SBATCH --ntasks-per-node=1             # number of MPI tasks per node
#SBATCH --time=99:00:00                 # max exec time
#SBATCH --output=dan_train_hugin_munin_page_%j.out         # output log file
#SBATCH --error=dan_train_hugin_munin_page_%j.err          # error log file

module purge                            # purging modules inherited by default
module load anaconda-py3

conda activate /gpfswork/rech/rxm/ubz97wr/.conda/envs/dan/

# print started commands
set -x

# execution
teklia-dan train

Train on multiple GPUs

To train on multiple GPUs, one needs to update the parameters in the training configuration file, as detailed in the dedicated page. In addition, the number of GPUs required must be specified in the train_dan.sh file by updating the following line:

#SBATCH --gres=gpu:<nb_gpus>            # number of GPUs per node

Supervise a job

  • Use squeue -u $USER. This command should give an output similar to the one presented below.

(base) [ubz97wr@jean-zay1: ubz97wr]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1762916   gpu_p13 pylaia_t  ubz97wr  R   23:07:54      1 r7i6n1
           1762954   gpu_p13 pylaia_t  ubz97wr  R   22:35:57      1 r7i3n1

Delete a job

  • Use scancel $JOBID to cancel a specific job.

  • Use scancel -u $USER to cancel all your jobs.