Monitoring a node during a job.

You have ssh access to nodes in your running job(s) . A couple basic monitoring tools are demonstrated in the example transcript here. Screen shots are appended so that you can see the output from the tools.

[arnoldg@dt-login03 python]$ squeue -u $USER
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           1214412 gpuA40x4- interact  arnoldg  R       8:14      1 gpub045
[arnoldg@dt-login03 python]$ ssh gpub045
gpub045.delta.internal.ncsa.edu (141.142.145.145)
  OS: RedHat 8.4   HW: HPE   CPU: 64x    RAM: 252 GB
Last login: Wed Dec 14 09:45:26 2022 from 141.142.144.42
[arnoldg@gpub045 ~]$ nvidia-smi

[arnoldg@gpub045 ~]$ module load nvtop
----------------------------------------------------------------------------------------------------------
The following dependent module(s) are not currently loaded: cuda/11.6.1 (required by: ucx/1.11.2, openmpi/4.1.2)
----------------------------------------------------------------------------------------------------------

The following have been reloaded with a version change:
  1) cuda/11.6.1 => cuda/11.7.0

[arnoldg@gpub045 ~]$ nvtop

[arnoldg@gpub045 ~]$ module load anaconda3_gpu
[arnoldg@gpub045 ~]$ nvitop

[arnoldg@gpub045 ~]$ top -u $USER

Page tree

Monitoring a node during a job.