Monitoring jobs on the HEC

Viewing job state, or: Is my job running yet?

The most basic type of job monitoring is to check the job’s current state, ie is it running, completed or still pending?

This can be done with the command squeue --me and produces output similar to this emample:

JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
  serial     myjob testuser  R    1:13:19      1 comp17-08
  serial     myjob testuser  R       0:02      1 comp17-08
  serial     myjob testuser PD       0:00      1 (Resources)

The output columns are fairly self-explanatory:

JOBID

A number used to uniquely identify your job within the job scheduling system. Use this

number when you want to delete or cancel a job via the scancel command.

- PARTITION
- The name of the partition (queue) the job was submitted to
- NAME
- The job’s name, as specified by the job submission script’s -J directive, or the filename of the job script if a name wasn’t assigned
- USER
- The user (account) name of the job owner
- ST
- The state of the job, e.g. R for running or PD for pending
- TIME
- The amount of time the job has been running for in the format hh:mm:ss
- NODE
- The number of compute nodes requested for a parallel job. Serial jobs or parallel jobs that don’t use all cores on a compute node will show a value of 1
- NODELIST(REASON)
- For running jobs this is the list of compute nodes the job is running on. For pending jobs this will give the reason for the pending status (e.g. Resources if the cluster is busy, or Priority if other pending jobs have a higher priorty to run

The default action for squeue is to output basic information on all jobs. The list of all users’ jobs is usually very long!

Job Lifecycle

A job’s lifecycle can be tracked via the ST (state) field in the output from squeue. All jobs start with a status of PD (pending). If the cluster is busy, or the job has requested a resource which is currently fully utilised, then a job may spend some time in this state.

Once an appropriate job slot is available, the job’s status changes to R (running). A job just finishing will show as state CG (completing) but this state should not last for more than a few seconds. When a job no longer appears on the squeue output, it has either finished or has been deleted.

By default, jobs will create two output files containing the output that would normally appear on the screen when run in non-batch mode, and corresponding to the linux stdout and stderr channels. Output files names are a combination of the job name, the job id and the I/O channel name. So a job named mytest submitted as job id 1234 would create output files named mytest.1234.out and mytest.1234.err

Job output files are created once the job starts running, though they will initially be empty and will slowly fill with output.

Note

The job output files will collect any output not collected by other means. Some applications may save their output into separate files - check the user guide for your application to see how it behaves.

Note

All files written to in a batch job use buffering in order to make efficient use of the file system. This means that you won’t see any output in your files straight away. Once there’s 8 kilobytes of output - or once a process within your job finishes - the buffer contents will be written to the relevant output file.

Tip

If squeue doesn’t show a job you expect to be in the queue, check to see if the job’s output files exist. For very short jobs (including those that stop quickly due an error) it’s common for the job to transition from pending to running to completed before you’ve had chance to view it with squeue.

Email notification of job completion

Rather than repeatedly running squeue to check the state of your jobs, you can opt to receive email notification when your jobs complete by adding the following line to your job submission command:

--mail-type=END,FAIL --mail-user=youraddress@lancaster.ac.uk

Alternatively, you can add the following lines to your job submission script:

#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=youraddress@lancaster.ac.uk

The email will contain a summary of the resources used by your job, for example:

Job ID: 1140
Cluster: hec-main
User/Group: testuser/local
State: COMPLETED (exit code 0)
Nodes: 2
Cores per node: 16
CPU Utilized: 00:00:30
CPU Efficiency: 3.23% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:29
Memory Utilized: 3.54 MB
Memory Efficiency: 0.00% of 121.09 GB

Email notification for job arrays

When applied to job arrays, the mailback option would result in a notification for every completed array element - so a 10,000 element job array will result in 10,000 email notifications. To prevent overloading the mail system, job arrays with the mailback option set will be rejected at submission time.

If you’d like to be notified when a job array finishes, create a dummy job (i.e. one which does very little work) with the email notification commands above, and make it dependent on the completion of the job array by adding the command line arguments -d jobid to sbatch, where jobid is the job ID of the job array. This will cause the dummy job to wait until all elements of the specified job array have finished before it runs - it will then run for a few seconds, complete, and email you.

Viewing your job resource quota with qquota

To ensure a fair share of the cluster, each user is capped by a set of resource quotas implemented using SLURM’s QoS (Quality of Service) feature. Jobs submitted to the cluster are eligible to run provided they don’t cause the user’s resource usage to exceed their current quota. In cases where job start would cause a resource quota to be breached the job is held waiting until the user resource usage has reduced by enough capacity to support it - typically by waiting for other running jobs to complete.

Currently two resource quotas are enforced:

Job slots have a quota of 350 (i.e. a user may have running jobs consuming a total of up to 350 job slots or cores)

Memory usage is capped at a total of 1.64TB (i.e. users may have running jobs totalling up to 1.64TB of memory reservations, which with a job slot quota of 350, averages 4.8GB per job slot). Please refer to Running large memory jobs on the HEC for an explanation of job memory reservation requests.

Note

Users granted access to the HEC as an exception to the usual access policy will have smaller quotas than the examples given.

Resource quotas can be viewed using the qquota command:

wayland-2022% qquota

QOSname        Cores           Memory(GB)
 normal       64/350             512/1640

Note that if you haven’t run any jobs recently then the output will be blank, as no QoS record will exist for you.

Monitoring job efficiency

Job resource requests such as the number of cores or the amount of memory are reservations, much like reserving a table at a restaurant. That means the requested resources are reserved for that job whether or not the job makes full use of them. It’s important to keep tabs on your jobs’ resource usage to make sure that resource requests made by jobs are accurate; having a large number of inaccurate resource requests on the cluster will result in the cluster becoming starved of those resources for waiting jobs, even though the currently running jobs aren’t using them.

There are several tools to monitor how a job is - or has been - using resources, with different tools allowing for monitoring of jobs once they’ve completed or while they’re still running.

Monitoring of completed jobs

Summaries of completed jobs are stored in a database, which can be interrogated via the sacct command. The database structure is complex, so it’s often best to view job summaries via helper scripts which use sacct under the bonnet as described below. You can still access the sacct command directly - see man sacct for details on how its used.

Job resource usage summaries via seff

The job resource usage summary shown in the mailback notification for job completion above can also be run at any time via the seff helper script. E.g. for job ID 1168, which runs a serial (single CPU) benchmark for the Yank free energy calculation framework the command seff 1168 produces this output from a serial job:

Job ID: 1168
Cluster: hec-main
User/Group: testuser/local
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:31:12
CPU Efficiency: 99.47% of 00:31:22 core-walltime
Job Wall-clock time: 00:31:22
Memory Utilized: 2.10 GB
Memory Efficiency: 42.07% of 5.00 GB

The output shows that CPU utilisation was very high (close to 100%), so good use was made of the requested CPU resource. Memory utilisation however was below 50%, suggesting that the job’s memory resource request should be have been lower. (Note that the job was manually stopped via the scancel command after half an hour, hence the job state of CANCELLED).

Job summaries via qacct

The qacct command acts as a wrapper to sacct and extracts only key job information. Using the previous job as an example, we can run:

qacct -j 1168

Which produces the output:

JobID      1168
JobName    yank-serial.sb
Partition  serial
User       testuser
Submit     2022-12-12T11:07:08
Start      2022-12-12T11:07:08
End        2022-12-12T11:38:30
ExitCode   0:0
State      CANCELLED by testuser
AllocTRES  billing=1,cpu=1,mem=5G,node=1
NodeList   comp17-08

The output provides basic information such as the job name, submit-, start- and end-timestamps, and the resources requested. Additional fields can be added using the -o option which is passed on to the underlying call to sacct (see the sacct man page for details of the -o option. Note that the qacct excludes information on job steps, so some fields may be empty).

Monitoring running jobs

As jobs can run for several hours or days, it’s useful to see how jobs are running - and what they’re running - in order to spot any potential problems in a job as early as possible. This is especially useful when running any new type of workload - either a new application, or a different model within an existing application. The commands qcgtop and qtop can help with this monitoring.

Monitoring jobs with qcgtop

The qcgtop command will show a summary of current CPU and memory usage for your jobs. Each Slurm job is managed by a Linux control group, which on a typical Linux desktop or server can be viewed via the systemd-cgtop command. The qcgtop command uses this information or provide a job resource usage summary.

Consider the following job output from squeue --me, which shows a 2-node parallel job running:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 1142  parallel imb-32wa testuser  R       0:03      2 comp17-[08-09]

The current amount of memory and CPU resource being consumed by the running job can be viewed vie the command

wayland-2022% qcgtop -u testuser

       Job   %CPU Memory
       ---   ---- ------
comp17-08
  job_1142 1592.8   1.3G
comp17-09
  job_1142 1590.8   1.3G

The output shows the CPU and memory utilisation of each job on each node. The CPU usage reported in this example is close to 1600%, which is the expected value for parallel jobs fully utilising all CPUs on a 16-core compute node.

Monitoring jobs with qtop

While the qcgtop tool described above provides an overall summary of each jobs’ CPU and memory usage, it doesn’t provide a breakdown of the individual processes within a job. The qtop tool can be used to view the individual processes within jobs - along with each process’ memory and CPU utilisation. The drawback with qtop is that it isn’t job-aware and will simply display each process being run on each compute node.

As an example of its usage, consider the following job list for user testuser running the command squeue --me to view their jobs:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
serial    nbody2.s testuser  R       0:01      1 comp01-02
serial    nbody.sb testuser  R       0:16      1 comp01-01
serial    nbody.sb testuser  R       3:43      1 comp01-01

The output shows that testuser has three jobs running across two compute nodes: comp01-01 and comp01-02. The result of running qtop -u testuser looks like this:

Host: comp01-01
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1235678 testuser  20   0 7077748   1.2g 131936 R 100.0   0.6   3:46.31 nbody
1235915 testuser  20   0 7077748   1.2g 132180 R 100.0   0.6   0:19.46 nbody
1235961 testuser  20   0   50120   4508   3600 R   1.0   0.0   0:00.01 top
1235622 testuser  20   0   15268   3608   3156 S   0.0   0.0   0:00.00 slurm_s+
1235859 testuser  20   0   15268   3608   3156 S   0.0   0.0   0:00.00 slurm_s+
1235960 testuser  20   0  141276   5404   3912 S   0.0   0.0   0:00.00 sshd
Host: comp01-02
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1229929 testuser  20   0   10.8g   1.5g 954020 R  99.0   0.8   0:05.46 nbody
1229873 testuser  20   0   15268   3596   3140 S   0.0   0.0   0:00.00 slurm_s+
1229944 testuser  20   0  141276   5524   4032 S   0.0   0.0   0:00.00 sshd
1229945 testuser  20   0   50120   4476   3560 R   0.0   0.0   0:00.01 top

The output fields for processes are identical to those for the standard linux top command executed in batch mode - see the man page for an in-depth description of the meaning of each field. This description will cover only the more relevant fields. Sets of processes are grouped so that all of a user’s processes on a compute node appear together.

The first thing to note is that the information provided by qtop is very different from that of squeue. qtop is not an integrated part of SLURM so it will output process information from each compute node with a running job, rather than job information - a single job will involve executing a number of processes on a compute node. You’ll need to compare qtop and squeue output to work out just what’s going on. For example, qtop doesn’t give you the job-ID number, and it often lists two or more processes where squeue or qcgtop lists just one job.

The three most relevant fields in the output are labelled COMMAND, RES and CPU.

The COMMAND field shows the name of the command being run by the process. Because jobs are submitted to the cluster as a job the job script itself becomes a process, which is named slurm_script, shortened to slurm_s+ in the above output. The slurm_script typically consumes very little CPU - it’s simply setting up the job’s working environment and then calling the applications requested in the job submission script.

As the qtop command runs the standard Linux top command on each compute node, this command will also appear in the list along with an ssh process (labelled sshd) which enables the remote command.

For most purposes, you’ll be interested in the remaining process(es) listed - typically the main process that your job script is currently running. In the above example the remaining processes are all called nbody - one of the applications available on the HEC and the main command in submitted job scripts.

The RES field gives the total resident memory size of each process. Smaller process sizes are listed in (k)ilobytes, larger ones in (m)egabytes, or even (g)igabytes.

The other useful field in the qtop output is CPU, which describes how much of a single CPU the process is consuming. Typically a running serial job should be consuming very close to 100% of a CPU’s resources. In contrast, an MPI parallel job will show multiple processes, each consuming around 100% CPU. OpenMP and other multi-threaded processes will show a single process entry consuming several hundred percent CPU - ideally 100x the number of cores being used. Values considerably lower than these ideals will likely indicate some problem; the process might be spending a disproportionate amount of time performing file reads or writes; or in the case of badly balanced parallel programs one process might be idle while waiting for a communication from another process.

Note that the PID field gives the Linux process ID, not the SLURM Job ID. Each process on a Linux system is assigned a unique process ID, which forms part of the standard output for top.