Monitoring jobs on the HEC
==========================

.. _squeue:

Viewing job state, or: Is my job running yet?
---------------------------------------------

The most basic type of job monitoring is to check the job's current
*state*, ie is it running, completed or still pending?

This can be done with the command ``squeue --me``  
and produces output similar to this emample:

.. code-block:: console

    JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
      930    serial     myjob testuser  R    1:13:19      1 comp17-08
      931    serial     myjob testuser  R       0:02      1 comp17-08
      932    serial     myjob testuser PD       0:00      1 (Resources)
  
The output columns are fairly self-explanatory:

.. list-table::

  * - JOBID
    - A number used to uniquely identify your job within the job scheduling system. Use this number when you want to delete or cancel a job via the ``scancel`` command.
  * - PARTITION
    - The name of the partition (queue) the job was submitted to
  * - NAME
    - The job's name, as specified by the job submission script's *-J* directive, or the filename of the job script if a name wasn't assigned
  * - USER
    - The user (account) name of the job owner
  * - ST
    - The state of the job, e.g. R for running or PD for pending
  * - TIME
    - The amount of time the job has been running for in the format hh:mm:ss
  * - NODE
    - The number of compute nodes requested for a parallel job. Serial jobs or parallel jobs that don't use all cores on a compute node will show a value of 1
  * - NODELIST(REASON)
    - For running jobs this is the list of compute nodes the job is running on. For pending jobs this will give the reason for the pending status (e.g. Resources if the cluster is busy, or Priority if other pending jobs have a higher priorty to run

The default action for ``squeue`` is to output basic information on all jobs. 
The list of all users' jobs is usually very long!

Job Lifecycle
^^^^^^^^^^^^^

A job's lifecycle can be tracked via the ST (state) field in the output 
from ``squeue``. All jobs start with a status of PD (pending). If the cluster 
is busy, or the job has requested a resource which is currently fully 
utilised, then a job may spend some time in this state.

Once an appropriate job slot is available, the job's status changes 
to R (running). A job just finishing will show as state CG (completing) 
but this state should not last for more than a few seconds. When a job 
no longer appears on the ``squeue`` output, it has either finished or has 
been deleted.

By default, jobs will create two output files containing the output that would
normally appear on the screen when run in non-batch mode, and corresponding
to the linux *stdout* and *stderr* channels. Output files names are a combination
of the *job name*, the *job id* and the I/O channel name. So a job named **mytest**
submitted as job id **1234** would create output files named **mytest.1234.out** and
**mytest.1234.err**

Job output files are created once the job starts running, though they will
initially be empty and will slowly fill with output. These files will be simple
text files and can viewed using standard file paging commands such as ``less``

.. note::

  The job output files will collect any output not collected by other means.
  Some applications may save their output into separate files - check the user
  guide for your application to see how it behaves.
  
.. note::

  All files written to in a batch job use *buffering* in order to make efficient use
  of the file system. This means that you won't see any output in your files straight away.
  Once there's 8 kilobytes of output - or once a *process* within your job finishes -
  the buffer contents will be written to the relevant output file.

.. tip::
  If ``squeue`` doesn't show a job you expect to be in the queue, check to see 
  if the job's output files exist. For very short jobs (including those 
  that stop quickly due an error) it's common for the job to transition 
  from pending to running to completed before you've had chance to view it 
  with ``squeue``.

.. _email:

Email notification of job completion
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Rather than repeatedly running ``squeue`` to check the state of your jobs, 
you can opt to receive email notification when your jobs complete by adding 
the following line to your job submission command:

.. code-block:: console

  --mail-type=END,FAIL --mail-user=youraddress@lancaster.ac.uk

Alternatively, you can add the following lines to your job submission script:

.. code-block:: bash

  #SBATCH --mail-type=END,FAIL
  #SBATCH --mail-user=youraddress@lancaster.ac.uk

The email will contain a summary of the resources used by your job, for example:

.. code-block:: console

  Job ID: 1140
  Cluster: hec-main
  User/Group: testuser/local
  State: COMPLETED (exit code 0)
  Nodes: 2
  Cores per node: 16
  CPU Utilized: 00:00:30
  CPU Efficiency: 3.23% of 00:15:28 core-walltime
  Job Wall-clock time: 00:00:29
  Memory Utilized: 3.54 MB
  Memory Efficiency: 0.00% of 121.09 GB

Email notification for job arrays
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When applied to :doc:`job arrays</jobarrays>`, the mailback option would result in a 
notification for every completed array element - so a 10,000 element 
job array will result in 10,000 email notifications. To prevent 
overloading the mail system, job arrays with the mailback option set 
will be rejected at submission time.

If you'd like to be notified when a job array finishes, create a dummy 
job (i.e. one which does very little work) with the email notification 
commands above, and make it dependent on the completion of the job array 
by adding the command line arguments ``-d jobid`` to sbatch, where *jobid* 
is the job ID of the job array. This will cause the dummy job to wait until 
all elements of the specified job array have finished before it runs - 
it will then run for a few seconds, complete, and email you.

.. _qquota:

Viewing your job resource quota with qquota
-------------------------------------------

To ensure a fair share of the cluster, each user is capped by a set of 
*resource quotas* implemented using SLURM's QoS (Quality of Service) 
feature. Jobs submitted to the cluster are eligible to run provided 
they don't cause the user's resource usage to exceed their current quota. 
In cases where job start would cause a resource quota to be breached the 
job is held waiting until the user resource usage has reduced by enough 
capacity to support it - typically by waiting for other running jobs to complete.

Currently two resource quotas are enforced:

**Job slots** have a quota of 512 (i.e. a user may have running jobs consuming 
a total of up to 512 job slots or cores)

**Memory usage** is capped at a total of 2TB (i.e. users may have running jobs 
totalling up to 2TB of memory reservations, which with a job slot quota of 512, 
averages 4GB per job slot). Please refer to :doc:`/largemem`
for an explanation of job memory reservation requests.

.. note::

   Users granted access to the HEC as an exception to the usual access policy will have
   smaller quotas than the examples given.

Resource quotas can be viewed using the ``qquota`` command:

.. code-block:: console

  wayland-2022% qquota

  QOSname        Cores           Memory(GB)
   normal       64/384             512/1844
    
Note that if you haven't run any jobs recently then the output will be blank, as 
no QoS record will exist for you.

Monitoring job efficiency
-------------------------

Job resource requests such as the number of cores or the amount of
memory are *reservations*, much like reserving a table at a
restaurant. That means the requested resources are reserved for that
job whether or not the job makes full use of them. It's important to
keep tabs on your jobs' resource usage to make sure that resource
requests made by jobs are accurate; having a large number of inaccurate resource
requests on the cluster will result in the cluster becoming starved of
those resources for waiting jobs, even though the currently running
jobs aren't using them.

There are several tools to monitor how a job is - or has been - using
resources, with different tools allowing for monitoring of jobs once
they've completed or while they're still running.

Monitoring of completed jobs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Summaries of completed jobs are stored in a database, which can be 
interrogated via the ``sacct`` command. The database structure is 
complex, so it's often best to view job summaries via helper scripts 
which use ``sacct`` under the bonnet as described below. You can
still access the ``sacct`` command directly - see ``man sacct`` for
details on how its used.

Job resource usage summaries via seff
#####################################

The job resource usage summary shown in the mailback notification 
for job completion above can also be run at any time via the ``seff`` 
helper script. E.g. for job ID 1168, which runs a serial (single CPU) 
benchmark for the Yank free energy calculation framework the 
command ``seff 1168`` produces this output from a serial job:

.. code-block:: console

  Job ID: 1168
  Cluster: hec-main
  User/Group: testuser/local
  State: CANCELLED (exit code 0)
  Cores: 1
  CPU Utilized: 00:31:12
  CPU Efficiency: 99.47% of 00:31:22 core-walltime
  Job Wall-clock time: 00:31:22
  Memory Utilized: 2.10 GB
  Memory Efficiency: 42.07% of 5.00 GB

The output shows that CPU utilisation was very high (close to 100%), 
so good use was made of the requested CPU resource. 
Memory utilisation however was below 50%, suggesting that the job's 
memory resource request should be have been lower. (Note that the job was 
manually stopped via the scancel command after half an hour, hence 
the job state of CANCELLED).

.. _qacct:

Job summaries via qacct
#######################

The ``qacct`` command acts as a wrapper to ``sacct`` and extracts 
only key job information. Using the previous job as an example, 
we can run:

.. code-block:: console

  qacct -j 1168

Which produces the output:

.. code-block:: console

  JobID      1168
  JobName    yank-serial.sb
  Partition  serial
  User       testuser
  Submit     2022-12-12T11:07:08
  Start      2022-12-12T11:07:08
  End        2022-12-12T11:38:30
  ExitCode   0:0
  State      CANCELLED by testuser
  AllocTRES  billing=1,cpu=1,mem=5G,node=1
  NodeList   comp17-08
  
The output provides basic information such as the job name, submit-,
start- and end-timestamps, and the resources requested.  Additional
fields can be added using the ``-o`` option which is passed on to the
underlying call to ``sacct`` (see the ``sacct`` man page for details
of the ``-o`` option. Note that the ``qacct`` excludes information on
job steps, so some fields may be empty).

Monitoring running jobs
^^^^^^^^^^^^^^^^^^^^^^^

As jobs can run for several hours or days, it's useful to see how jobs
are running - and what they're running - in order to spot any potential
problems in a job as early as possible. This is especially useful when
running any new type of workload - either a new application, or a
different model within an existing application. The commands ``qcgtop``
and ``qtop`` can help with this monitoring.

.. _qcgtop:

Monitoring jobs with qcgtop
###########################

The ``qcgtop`` command will show a summary of current CPU and memory
usage for your jobs. Each Slurm job is managed by a Linux *control
group*, which on a typical Linux desktop or server can be viewed via
the ``systemd-cgtop`` command. The ``qcgtop`` command uses this
information or provide a job resource usage summary.

Consider the following job output from ``squeue --me``, which shows a 
2-node parallel job running:

.. code-block:: console

      JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
       1142  parallel imb-32wa testuser  R       0:03      2 comp17-[08-09]

The current amount of memory and CPU resource being consumed by the 
running job can be viewed vie the command

.. code-block:: console

  wayland-2022% qcgtop -u testuser

         Job   %CPU Memory
         ---   ---- ------
  comp17-08
    job_1142 1592.8   1.3G 
  comp17-09
    job_1142 1590.8   1.3G

The output shows the CPU and memory utilisation of each job on each
node.  The CPU usage reported in this example is close to 1600%, which
is the expected value for parallel jobs fully utilising all CPUs on a
16-core compute node.

.. _qtop:

Monitoring jobs with qtop
#########################

While the ``qcgtop`` tool described above provides an overall summary
of each jobs' CPU and memory usage, it doesn't provide a breakdown of
the individual processes within a job. The ``qtop`` tool can be used
to view the individual processes within jobs - along with each process'
memory and CPU utilisation. The drawback with ``qtop`` is that it
isn't job-aware and will simply display each process being run on each
compute node.

As an example of its usage, consider the following job list for user
*testuser* running the command ``squeue --me`` to view their jobs:

.. code-block:: console

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2286 serial    nbody2.s testuser  R       0:01      1 comp01-02
              2285 serial    nbody.sb testuser  R       0:16      1 comp01-01
              2284 serial    nbody.sb testuser  R       3:43      1 comp01-01

The output shows that *testuser* has three jobs running across two compute nodes:
*comp01-01* and *comp01-02*. The result of running ``qtop -u testuser`` looks like
this:

.. code-block:: console

  Host: comp01-01
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  1235678 testuser  20   0 7077748   1.2g 131936 R 100.0   0.6   3:46.31 nbody
  1235915 testuser  20   0 7077748   1.2g 132180 R 100.0   0.6   0:19.46 nbody
  1235961 testuser  20   0   50120   4508   3600 R   1.0   0.0   0:00.01 top
  1235622 testuser  20   0   15268   3608   3156 S   0.0   0.0   0:00.00 slurm_s+
  1235859 testuser  20   0   15268   3608   3156 S   0.0   0.0   0:00.00 slurm_s+
  1235960 testuser  20   0  141276   5404   3912 S   0.0   0.0   0:00.00 sshd
  Host: comp01-02
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  1229929 testuser  20   0   10.8g   1.5g 954020 R  99.0   0.8   0:05.46 nbody
  1229873 testuser  20   0   15268   3596   3140 S   0.0   0.0   0:00.00 slurm_s+
  1229944 testuser  20   0  141276   5524   4032 S   0.0   0.0   0:00.00 sshd
  1229945 testuser  20   0   50120   4476   3560 R   0.0   0.0   0:00.01 top

The output fields for processes are identical to those for the
standard linux ``top`` command executed in batch mode - see the man
page for an in-depth description of the meaning of each field. This
description will cover only the more relevant fields. Sets of
processes are grouped so that all of a user's processes on a compute
node appear together.

The first thing to note is that the information provided by ``qtop``
is very different from that of ``squeue``. ``qtop`` is not an
integrated part of SLURM so it will output process information from
each compute node with a running job, rather than job information - a
single job will involve executing a number of processes on a compute
node. You'll need to compare ``qtop`` and ``squeue`` output to work
out just what's going on. For example, ``qtop`` doesn't give you the
job-ID number, and it often lists two or more processes where
``squeue`` or ``qcgtop`` lists just one job.

The three most relevant fields in the output are labelled **COMMAND**,
**RES** and **CPU**.

The **COMMAND** field shows the name of the command being run by the
process. Because jobs are submitted to the cluster as a job the job
script itself becomes a process, which is named slurm_script,
shortened to **slurm_s+** in the above output.  The **slurm_script**
typically consumes very little CPU - it's simply setting up the job's
working environment and then calling the applications requested in the
job submission script.

As the ``qtop`` command runs the standard Linux ``top`` command on
each compute node, this command will also appear in the list along
with an ssh process (labelled **sshd**) which enables the remote
command.

For most purposes, you'll be interested in the remaining process(es)
listed - typically the main process that your job script is currently
running. In the above example the remaining processes are all called
**nbody** - one of the applications available on the HEC and the main
command in submitted job scripts.

The **RES** field gives the total *resident memory* size of each
process.  Smaller process sizes are listed in (k)ilobytes, larger ones
in (m)egabytes, or even (g)igabytes.

The other useful field in the qtop output is **CPU**, which describes
how much of a single CPU the process is consuming. Typically a running
serial job should be consuming very close to 100% of a CPU's
resources. In contrast, an MPI parallel job will show multiple
processes, each consuming around 100% CPU. OpenMP and other
multi-threaded processes will show a single process entry consuming
several hundred percent CPU - ideally 100x the number of cores being
used. Values considerably lower than these ideals will likely indicate
some problem; the process might be spending a disproportionate amount
of time performing file reads or writes; or in the case of badly
balanced parallel programs one process might be idle while waiting for
a communication from another process.

Note that the **PID** field gives the Linux process ID, not the SLURM
Job ID. Each process on a Linux system is assigned a unique process
ID, which forms part of the standard output for top.