Kronos Usage

TIP Kronos decommissioning:

Kronos has been decommissioned on 2020-31-12
See https://hpc.gsi.de/virgo on how to migrate to the new cluster Virgo.

Please read Reporting Issues before contacting the HPC group.

Use the alias kronos.hpc.gsi.de for login to the interactive cluster nodes.

TIP Subscribe to our user mailing list HPC-INFO! ( subscribe / sign off ).



Before you start - If you are new to Slurm@Kronos

Apply for a Slurm account

Unlike on its predecessor Prometheus, a Slurm user account is needed for using Slurm@Kronos. You can apply for it by asking your account coordinator of your department/group.

If no such coordinator exists for your group, just open a Trouble Ticket in our ticket system.

As information we need your user name and the group (account) to which you belong (e.g. alice, radprot, hades, ...). This should be your Linux account's primary group (gid), but it may differ. Nevertheless it should be one of your secondary groups (check with id). If the Slurm account you want to join is neither the primary nor a secondary group of your Linux account please open a ticket in the IT trouble ticket system (accounts-service@gsi.de) to get this fixed.

The accounting group (as the name says) is the account to which your used run time will be accounted to, which in turn influences your job priority (together with other parameters).

If you want to become an account coordinator read below.

Introduction

Introduction to Slurm video series by the developers.

List of basic Slurm commands:

Command Description
sinfodebian Provides information on cluster partitions and nodes.
squeuedebian Shows an overview of jobs and their states.
scontroldebian View Slurm configuration and state, also for un-/suspending jobs.
srundebian Run an executable as a single job (and job step). Blocks until the job is scheduled.
sallocdebian Submits an interactive job. Blocks until the job is scheduled, and the prompt appears.
sbatchdebian Submits a job script for batch scheduling. Returns immediately with job ID.
scanceldebian Cancels (or signals) a running or pending job.
sacctdebian Display data for all jobs and job steps in the accounting database

Storage

The cluster has access to shared storage in the directory /lustre/ on all nodes.

Shared storage is build with the Lustre parallel distributed file system. Make sure to read → LustreFs for a more detailed description. Access permission to the file-systems directories are granted by the coordinators of the experiment groups and departments. Users not associated to a particular user group should open a ticket in the IT trouble ticket system.

Software

Scientific software is available in a directory called /cvmfs/. Each sub-directory contains a cached copy of a software repository maintained by a department or experiment group. Find a list available software at the page:

SoftwareInCvmfs

Software within /cvmfs/ can be loaded to the job environment using moduledebian:

CVMFS
EnvironmentModules

Submit Your First Job

Before you submit your first job, apply for an account in Slurm. Access permission to the computing cluster is granted by the coordinators of the experiment groups and departments. Users not associated to a particular user group should contact the HPC Department via the ticket system.

This document illustrates examples with a fictive user called "jdow".
Follow the examples bellow by adjusting user name and directory paths accordingly!

The sinfodebian command displays nodes and partitions

Use the options -lNe with sinfo to view more informations about the available resources:

» sinfo -lNe
Thu May 22 13:55:41 2014
NODELIST                  NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
lxb[1193-1194,1196-1197]      4     main*        idle   64    4:8:2 258461   801785      1 Xeon,E55 none

The example above shows basic resources limits of the execution nodes, 4 sockets with 8 cores each supporting hyper treading (S:C:T) for a total of 64 available CPUs. Furthermore available memory and local storage. Be aware that these are NOT the free resources. The default partition main* is marked with an asterisk.

Batch scripts are submitted to the cluster using the sbatchdebian command.

The meta-command #SBATCH inside the job script marks Slurm configuration options:
  • Option --time=X-XX:XX:XX defines the run time limit for your job. If not set, the job will be killed after the default run time limit of the used partition (which is much lower than the max. run time limit). Setting a proper limit is important for the scheduling order.
    • Best practice is to run your workload a few times on the interactive part of the Kronos cluster to get an idea of how much time your jobs need, then set a safety margin of 50-100% on top of it and, in the end, submit your jobs with the needed run time limit to the destination partition.
  • Option -p name defines the partition name on which your jobs should run. If no partition name is defined, all jobs will be executed on the main partition even if the requested run time limit is higher than the one that the main partition provides. In other words : SLURM will NOT select a proper partition for your jobs by itself.
  • Option -D path defines the working directory. In case of missing access privileges for the working directory, Slurm falls back to /tmp/.
  • Options -o path and -e path define files to store output from stdout and stderr respectively. Use %j (JOBID) and %N (name of first node) to automatically adopt the file name to a job (by default stdout goes to stdout-%j.out). These directories MUST exist at job start time, they will not be created by SLURM.

In this example the user jdow stores files in /lustre/nyx/hpc/jdow/.

#!/bin/bash

# Task name
#SBATCH -J test

# Run time limit

#SBATCH --time=4:00:00

# Working directory on shared storage
#SBATCH -D /lustre/nyx/hpc/jdow

# Standard and error output in different files
#SBATCH -o %j_%N.out.log
#SBATCH -e %j_%N.err.log

# Execute application code
hostname ; uptime ; sleep 30 ; uname -a

Submit the script as a job to the cluster using a file called test.sh in a directory on the shared storage.

» sbatch test.sh
Submitted batch job 46

The system answers with the job identification number (JOBID) when the job has been accepted.

The command squeuedebian prints the state of the scheduling queue.
» squeue 
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   46     debug     test     jdow  R       0:04      1 lxb1197
» cat 46_lxb1197.out.log
lxb1197
11:01:16 up 2 days, 18:25,  3 users,  load average: 0.00, 0.01, 0.05
Linux lxb1197 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux

The "ST" state column indicates the current state of your job. For the beginning two states are important to know PD pending, and R running. The first indicates that the job is waiting to be executed as soon as a suitable resources become available. Depending on the load of the cluster this may take a while. If your jobs disappears from the list it has been removed from the cluster, most likely because it finished.

Track Your First Problem

In case of a failure during job execution, it is important to distinguish between a problem internal to the application and an issue of the job execution environment. The following code demonstrates a wrapper script that collects environment information useful for later debugging. It executes a program segfaulter which breaks with an internal error:

#!/bin/sh

# Task name
#SBATCH -J breaker

# Working directory on shared storage
#SBATCH -D /hera/hpc/jdow

# Standard and error output in different files
#SBATCH -o breaker_%j_%N.out.log
#SBATCH -e breaker_%j_%N.err.log

# Function to print log messages
_log() {
  local format='+%Y/%m/%d-%H:%M:%S'
  echo [`date $format`] "$@"
}

_log Job $SLURM_JOB_ID \($SLURM_JOB_NAME\)
_log Submitted from $SLURM_SUBMIT_DIR
# Identity of the execution host
_log Running on $USER@`hostname`:$PWD \($SLURM_JOB_NODELIST\)

# A faulty program
/hera/hpc/jdow/segfaulter &
# The process ID of the last spawned child process
child=$!
_log Spawn segfaulter with PID $child
# Wait for the child to finish
wait $child
# Exit signal of the child process
state=$?

_log Finishing with $state
# Propagate last signal to the system
exit $state

Helpful informations are the job identification number (line 19), the user account name, the execution host name and the submit directory (line 20). For most applications it will be necessary to check for more dependencies before starting the application program, for example the availability of input data and libraries dependencies. Furthermore the script logs the application process ID (line 28) and the program exit signal (line 30-34) before it is propagated to Resource Management System.

» sbatch breaker.sh 
Submitted batch job 267
» cat breaker_267_lxb1193.out.log 
[2014/05/22-16:09:04] Job 267 (breaker)
[2014/05/22-16:09:04] Submitted from /hera/hpc/jdow/
[2014/05/22-16:09:04] Running on jdow@lxb1193:/hera/hpc/jdow (lxb1193)
[2014/05/22-16:09:04] Spawn segfaulter with PID 52450
[2014/05/22-16:09:04] Finishing with 139

Above you can see the log of the failed segfaulter program. It includes the host process ID, as well as the last process exit signal number.

The command sacctdebian shows accounting data for finished jobs.

» sacct -j 267
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
267             breaker       main        hpc          1     FAILED    139:0 
267.batch         batch                   hpc          1     FAILED    139:0 

The exit code column presents the exit signal (aka exit status, return code and completion code), followed by the signal that caused the process to terminate if it was terminated by a signal. For sbatch jobs, the exit code that is captured is the output of the batch script. For salloc jobs, the exit code will be the return value of the exit call that terminates the salloc session.

Cluster Resources

General information on the cluster status is available using the commands sinfodebian and scontroldebian. Customize the output of sinfo by selecting columns with option -o.

List resources and features of execution nodes with:

» sinfo -o "%4c %5z %8d %8m %25f %N"
CPUS S:C:T TMP_DISK MEMORY   FEATURES                  NODELIST
64   4:8:2 801785   258461   Xeon,E5520,Infiniband     lxb[1193-1194,1196-1197]

Show a list of draining(this node will not accept new jobs) and down (not available) nodes:

» sinfo -o '%10n %8T %20H %E' -t 'drain,down'
HOSTNAMES  STATE    TIMESTAMP            REASON
lxb1193    draining 2014-08-22T10:01:27  Update of Slurm to 2.6.5
lxb1194    draining 2014-08-22T10:01:27  Update of Slurm to 2.6.5
lxb1196    down     2014-08-22T10:00:43  Update of Slurm to 2.6.5
lxb1197    down     2014-08-22T10:00:43  Update of Slurm to 2.6.5

Display run-time constrains and partition sizes:

» sinfo -o "%9P  %6g %10l %5w %5D %13C %N"
PARTITION  GROUPS TIMELIMIT  WEIGH NODES CPUS(A/I/O/T) NODELIST
main*      all    1-00:00:00 1     3     0/192/0/192   lxb[1193-1194,1196]
debug      all    1:00:00    1     1     0/64/0/64     lxb1197

The "TIMELIMIT" column use the format: day-hours:minutes:seconds.

Partitions

Slurm does not use the notion of queues like other Resource Managment Systems, but instead uses partitions which serve a similar role. Partitions group nodes into logical (possibly overlapping) sets. They have an assortment of constrains like job size or time limit, access control, etc. Users will need to understand the concept of partitions to allocate resources. By default sinfo lists partitions in the first column. The command scontrol shoes more detailed informations on partitions and execution nodes.

» scontrol show partition main
PartitionName=main
   AllocNodes=ALL AllowGroups=ALL Default=YES
   DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1
   Nodes=lxb119[3,4,6,7]
   Priority=1 RootOnly=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=256 TotalNodes=4 DefMemPerNode=UNLIMITED MaxMemPerCPU=4096

The default partition is marked with an asterisk, e.g. "main*"

» sinfo -o "%9P  %6g %10l %5w %5D %13C %N"
PARTITION  GROUPS TIMELIMIT  WEIGH NODES CPUS(A/I/O/T) NODELIST
main*      all    1-00:00:00 1     3     0/192/0/192   lxb[1193-1194,1196]
debug      all    1:00:00    1     1     0/64/0/64     lxb1197

Job can be send into a defined partition using the option --partition with srun, salloc, and sbatch.

CPU Allocation

Refer to the Slurm " CPU Management User Guide" for a more detailed description beyond the scope of the following section.

Terminology:

  • CPU - On multi-core systems CPUs are cores. No notion of sockets, cores, or threads.
  • Socket - A physical group of processors, usually containing multiple cores.
  • Core - A single processor unit.
  • Thread - One or more execution contexts within a single core.
  • Affinity - The state of binding a process to a physical core.
  • Task - A logical group of resources required to execute a program(/process).

By default a job consists of a single task allocating a single core.

The following table lists the options for the commands srundebian, sallocdebian, and sbatchdebian used to control the allocation of CPU resources:

Option Description
-n, --ntasks=<number> Number of tasks to start. (default=1)
--ntasks-per-node=<ntasks> Number of tasks to invoke on each node. (default=1)
--ntasks-per-core=<ntasks> Number of tasks to invoke on each core .
-c, --cpus-per-task=<ncpus> Number of CPUs per process (default=1). Useful for multi-threaded applications.
--threads-per-core= Number of threads per core to allocate. (default 1)
-N, --nodes=<minnodes[-maxnodes]> Number of nodes to allocate.
-O, --overcommit Explicitly allowing more than one process per CPU.
--exclusive Allocates nodes exclusively to a job
In the following example the command srun is used to execute hostname. The option --ntasks is applied to gradually increase the number of tasks. Each task allocates a core, hence when the number of required tasks exceeds the capabilities of a single node (here 32 cores), the job is spread to more nodes.

» srun hostname
lxb1193
» srun --ntasks 4 hostname
lxb1193
lxb1193
lxb1193
lxb1193
» srun --ntasks 32 hostname | sort | uniq -c
     32 lxb1193
» srun --ntasks 64 hostname | sort | uniq -c
     32 lxb1193
     32 lxb1194
» srun --ntasks 128 hostname | sort | uniq -c
     32 lxb1193
     32 lxb1194
     32 lxb1196
     32 lxb1197

Jobs can allocate individual sockets, cores and threads as consumable resource. The default allocation method across nodes is block allocation (allocate all available CPUs in a node before using another node). The default allocation method within a node is cyclic allocation (allocate available CPUs in a round-robin fashion across the sockets within a node). The option --ntasks-per-node enables users to distribution a specific number of tasks to nodes.

» srun --ntasks 4 --ntasks-per-node 2 hostname  
lxb1194
lxb1194
lxb1193
lxb1193
» srun --ntasks 4 --ntasks-per-node 1 hostname
lxb1194
lxb1197
lxb1193
lxb1196

Generic Resources & Features

Node can have associated Features to indicate node characteristics, as well as Generic Resources (GRES) representing specific hardware devices (e.g. GPGPUs). List resources and features of execution nodes with:
» sinfo -o "%4c %10z %8d %8m %10f %10G %D"
CPUS S:C:T      TMP_DISK MEMORY   FEATURES   GRES       NODES
40   2:10:2     550000   128000   (null)     (null)     150
40   2:10:2     550000   256000   hawaii     gpu:4      131
40   2:10:2     550000   256000   tahiti     gpu:4      9

Users can allocate GRES with option --gres=<list>, and features with --constrain=<list>:
[…] --gres=gpu:1 --constrain=tahiti […] 
[…] --gres=gpu:2*cpu […] 

GRES defined with a string following the pattern name[:count[*cpu]]:

  • name – Identifier name of the consumable resource
  • count – Number of resources (default 1).
  • *cpu – Allocate specified resource per CPU (instead of job on each node).

Array Jobs

Job arrays are only supported for batch jobs.

Array index values are specified using the option -a for the sbatchdebian command.

»  sbatch --array=1-5 […]
Submitted batch job 23
»  squeue 
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 23_1      main     test     jdow  R       0:04      1 lxb001
 23_2      main     test     jdow  R       0:04      1 lxb001
 23_3      main     test     jdow  R       0:04      1 lxb001
 23_4      main     test     jdow  R       0:04      1 lxb001
 23_5      main     test     jdow  R       0:04      1 lxb001
» sbatch -a=2,4,6,8 […]
[…]
» sbatch -a=0-8:2 […]
Submitted batch job 32
» squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 32_0      main     test     jdow  R       0:01      1 lxb001
 32_2      main     test     jdow  R       0:01      1 lxb001
 32_4      main     test     jdow  R       0:01      1 lxb001
 32_6      main     test     jdow  R       0:01      1 lxb001
 32_8      main     test     jdow  R       0:01      1 lxb001

Array jobs will have the environment variable SLURM_ARRAY_TASK_ID set to its array index value. Note that all array jobs share a common SLURM_ARRAY_JOB_ID, while having an individual SLURM_JOBID. Commands like squeue show the array job ID followed by its index number with an underscore as delimiter, e.g. above "23_5". Use the markers %A_%a to format output/error file names with option -o:

» sbatch -a=1-3 -o slurm-%A_%a.out.log […]
[…]
» scancel 23_1 23_4
[…]
» scancel 32_[2-6]
[…]

Use the array job ID to cancel all tasks with scanceldebian, or append the array indexes for specific tasks like it is demonstrated above.

Pending jobs are combined into a single line by the squeuedebian command use option -r to expand this list.

» squeue
      JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
47_[24-100]      main     test     jdow PD       0:00      1 (Resources)
       47_1      main     test     jdow  R       0:08      1 lxb001
       47_2      main     test     jdow  R       0:08      1 lxb001
       47_3      main     test     jdow  R       0:08      1 lxb001
       47_4      main     test     jdow  R       0:08      1 lxb001
[…]

Parallel Applications

You may want to load an appropriate version of OpenMPI from CVMFS using EnvironmentModules, i.e.:
>>> source /etc/profile.d/modules.sh
>>> module use /cvmfs/it.gsi.de/modulefiles/
>>> module load openmpi/gcc/1.10.0
>>> which mpicc mpirun
/cvmfs/it.gsi.de/openmpi/gcc/1.10.0/bin/mpicc
/cvmfs/it.gsi.de/openmpi/gcc/1.10.0/bin/mprun

Compile a simple MPI program hello_world.c with mpiccdebian and execute it with mpirundebian:

>>> mpicc -o hello_world hello_world.c
>>> mpirun -np 4 hello_world 

Interactive

Execute the same program on the cluster by allocation resources with salloc:

>>> salloc -p debug -N2 -n40 bash
salloc: Granted job allocation 162
$ mpirun hello_world
Hello world lxb1193.2918 [15/40]
Hello world lxb1194.64133 [35/40]
Hello world lxb1193.2905 [7/40]
Hello world lxb1193.2908 [10/40]
Hello world lxb1193.2898 [0/40]
Hello world lxb1194.64151 [52/40]
Hello world lxb1193.2919 [16/40]
Hello world lxb1194.64150 [51/40]
Hello world lxb1193.2934 [22/40]
Hello world lxb1193.2907 [9/40]
[…]
$ exit
salloc: Relinquishing job allocation 162

Note that you can not use the srun command; it is not usable with MPI programs!

Batch

Simple batch script used to start an MPI application:

#!/bin/bash

#SBATCH -D /lustre/nyx/hpc/jdow
#SBATCH -o %j_%N.out.log
#SBATCH -e %j_%N.err.log

# Resource requirements for parallel execution
#SBATCH -N 2
#SBATCH -n 40
#SBATCH -p debug

# Load the required Open MPI environment
source /etc/profile.d/modules.sh
module use /cvmfs/it.gsi.de/modulefiles/
module load openmpi/gcc/1.10.0

# Execute the application
mpirun hello_world

Job Management

Common Job Options

Slurm commands print a summery of supported options using options --usage:

» sinfo --usage
Usage: sinfo [-abdelNRrsv] [-i seconds] [-t states] [-p partition] [-n nodes]
             [-S fields] [-o format] 

Job meta data, output streams, mail hooks, etc.:

Option Description
-o, --output Specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "stdout-%j.out"
-e, --error Specify a file name that will be used to store all error output (stderr). (See above)
--job-name Name of the job (24 characters max.)
--comment Attach a comment to the job
--email-type Events triggering a mail (BEGIN,END,FAILED)
--email-user Receiver email address (a fully qualified email address is required, e.g.: j.doe@gsi.de)
--begin Delay the start of a job, e.g. 16:00, now+1hour, or 2014-01-02T13:15:00
Resource allocation options:

Option Description
-M, --clusters Cluster name (one or many)
-p, --partition Partition name (on or many)
--mem Memory per node, in MB
--mem-per-cpu Memory per cpu, in MB
--gres Generic resources (e.g. GPUs)
--licenses Software licenses
Options to set restrictions and/or constrains:

Option Description
--constrain Choose a feature (e.g. Xeon)
--mincpus
--tmp
-d, --dependency=after(ok,notok,any):jobid Specify job ordering
--reservation
--share
--contiguous
--geometry
-F, --nodefile
-w, --nodelist Restrict jobs to a set of nodes (comma separated list).
-x, --exclude Exclude nodes from a job (comma separated list).
--switches

Requeue

Jobs get automatically re-queue if compute nodes fail during execution.

It is possible to alter this behavior using following options with job submission:

Option Description
--requeue Default; Automatically requeue a job after node failure.
--no-requeue Prevent the job from being requeued
The requeue configuration flag (1 = true) defines this behavior for each job:

» scontrol show job 2415 | grep Requeue
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
» scontrol update jobid=2415 requeue=0

Set requeue to zero to disable this behavior for a running job.

Monitoring Jobs

Watch the scheduling queue with the squeuedebian command.

By default the output is limited to the jobs belonging to the user account. Typically the output contains the job identifier used with other command to interact with a running job.

» squeue
JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  139      main   stress     jdow  PD       0:00      1 (Resources)
  140      main   stress     jdow  PD       0:00      1 (Priority)
  138      main   stress     jdow   R       0:03      2 lxb[1194,1196]
  137      main   stress     jdow   R       0:11      2 lxb[1193-1194]
  135      main   stress     jdow   R       0:12      1 lxb1196
  136      main   stress     jdow   R       0:12      1 lxb1196
  134      main   stress     jdow   R       0:13      1 lxb1194
  132      main   stress     jdow   R       0:14      1 lxb1193
  133      main   stress     jdow   R       0:14      1 lxb1194
  131      main   stress     jdow   R       0:22      1 lxb1193

Option -t state limits the list to jobs in a certain state. Job states (ST) include the following in the output of squeue.

State Code Meaning
PENDING PD Job is awaiting resource allocation.
RUNNING R Job currently has an allocation.
SUSPENDED S Job has an allocation, but execution has been suspended.
COMPLETING CG Job is in the process of completing. Some processes on some nodes may still be active.
COMPLETED CD Job has terminated all processes on all nodes.
CONFIGURING CF  
CANCELED CA

Job was explicitly cancelled by the user or system administrator.

The job may or may not have been initiated.

FAILED F Job terminated with non-zero exit code or other failure condition.
TIMEOUT TO Job terminated upon reaching its time limit.
PREEMPTED PR Job has been suspended by an higher priority job on the same ressource.
NODE_FAIL NF Job terminated due to failure of one or more allocated nodes.
Detailed information about job parameters are determined with the scontroldebian command.

» scontrol show job 145
JobId=145 Name=stress
   UserId=jdow(3535) GroupId=hpc(1082)
   Priority=2 Account=hpc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:02:02 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2014-03-26T10:26:29 EligibleTime=2014-03-26T10:26:29
   StartTime=2014-03-26T10:26:29 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=main AllocNode:Sid=lxdv111:10136
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=lxb1193
   BatchHost=lxb1193
   NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/hera/hpc/jdow/tests/stress.sh 600s 12 500M
   WorkDir=/hera/hpc/jdow/tests

Why does my job not start?

Resources

This just means that you/your job have the necessary priority but all ressources (requested by your job) are already allocated at this time.

Priority

All other jobs in the queue before your job have a higher priority.

Dependency

The job is waiting for another job to be finished.

scontrol show job JobID | grep JobState

JobState=PENDING Reason=Dependency Dependency=afterany:anotherJobID

BeginTime

Set a start time in the future for the job.

PartitionTimeLimit

The used partition can't fulfill the requested run time. Please select another partition for the job

scontrol update job=JobID partition=....

JobHeldUser

Either you (, your account coordinator, one of the admins or the system) has suspended your job.

In the later case this means that the job had already been scheduled to the cluster but it has ended with a problem and was requeued.

scontrol show job JobID | grep Restarts

Requeue=1 Restarts=1 BatchFlag =2 ExitCode =0:0
A JobHeldUser flag can be released by the associated user itself using scontrol release JobID

JobHeldAdmin

An administrator has suspended your job, and it can be released only by an admin. Usually you will get a friendly mail after such an action smile

For a full list of reasons please have a look on http://slurm.schedmd.com/squeue.html#lbAF

(please keep in mind that there could be differences in the documentation as this one refers to the latest version of SLURM)

Priorities & Shares

Several factors contribute to the calculation of job priorities. Among them job size, partition time, and fair-share. The fair-share factor is calculated based on a defined share for each group and the users inside the groups (so even if you were lazy in the past it could be that your jobs will not get a high priority). The user shares are shares inside their corresponding group. Fair-share considers historical use of the cluster resources to achieve a long-term balancing of resource share. Historical accounting information has a decay half-life (currently 7 days) to reduce the long term effect of past resource.

Display the priority calculation configuration of the cluster controller with scontroldebian.

» scontrol show config | grep ^Priority
PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = 0
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 0
[…]

List pending jobs sorted by priority with squeuedebian:

» squeue -o '%.7i %.9Q %.9P %.8j %.8u %.8T %.10M %.11l %.8D %.5C %R' -S '-p' --state=pending
[…]

Show the priority given to a job with squeuedebian:

» squeue -o %Q -j JOBID
[…]
» sprio -w
[…]

The ssharedebian command lists the shares of associations.

» sshare 
             Account       User Raw Shares Norm Shares   Raw Usage Effectv Usage  FairShare 
-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 
root                                          1.000000      288990      1.000000   0.500000 
alice                                   30    0.297030           0      0.000000   1.000000 
cbm                                     20    0.198020           0      0.000000   1.000000 
hades                                   20    0.198020           0      0.000000   1.000000 
hpc                                     10    0.099010      288990      1.000000   0.000911 
  hpc                      jdow     parent    0.099010      288990      1.000000   0.000911 
panda                                   20    0.198020           0      0.000000   1.000000 

One can lower the priority of it's own jobs (in case of being an account coordinator also jobs from the same group) if you have important and not-so-omportant jobs at the same time :

>scontrol upate job=JobID nice=yyy

Accounting

The command sacctdebian reports resource usage for running or terminated jobs including individual tasks, which can be useful to detect load imbalance between the tasks. View a summery with option -b:

» sacct -b | tail -20
28            COMPLETED      0:0 
29               FAILED    127:0 
30               FAILED    127:0 
31            COMPLETED      0:0 
32            COMPLETED      0:0 
33            COMPLETED      0:0 
34            COMPLETED      0:0 
35            COMPLETED      0:0 
36            COMPLETED      0:0 
37           CANCELLED+      0:0 
38               FAILED    127:0 
39               FAILED    127:0 
40            COMPLETED      0:0 
41            COMPLETED      0:0 
42            COMPLETED      0:0 
43            COMPLETED      0:0 
44               FAILED    130:0 
45            COMPLETED      0:0 
46            COMPLETED      0:0 
46.batch      COMPLETED      0:0 

The output is customizable with option -o (list available fields with -e).

» sacct -j 46.batch -o 'JobID,NodeList,NCPUS,CPUTime,MaxRSS,' 
       JobID        NodeList      NCPUS    CPUTime     MaxRSS 
------------ --------------- ---------- ---------- ---------- 
46.batch             lxb1197          1   00:00:30      1964K 
» sacct --format "JobID%3,User%10,CPUTime%8,NodeList" 
Job       User  CPUTime        NodeList 
--- ---------- -------- --------------- 
  2       jdow 00:00:00          lxb007 
  3       jdow 00:00:00    lxb[001-004] 
  4       jdow 00:00:04    lxb[001-004] 
  5       jdow 00:00:40    lxb[001-004] 
  6       jdow 00:00:15    lxb[001-003]

Account Coordinators

What's an Account Coordinator?

Account coordinators organize the cluster usage for specific

The account coordinator can:
  • create Kronos accounts for users in your group
  • distribute group shares over the users
  • modify/suspend/delete jobs of all users in your group

There can be more than one Account Coordinator per group, and this is not necessarily a lifetime job wink

Becoming an Account Coordinator for your group

If you think you are the ideal person for being the account coordinator for your group/department/experiment, you can ask us and we will have a look.

If you are already an Account Coordinator

Usually there are two things to do in the beginning, e.g.:
  • sacctmgr add user alice account=atWonderland
    • where 'alice' is the Linux user name and 'atWonderland' is the Kronos group the user should belong/account to (usually equal to your own group)
  • sacctmgr modify user alice where account=atWonderland set GrpCPUs=1024 GrpJobs=1000 GrpSubmit=5000
    • and some changes to restrict the user a little bit : max. CPUs 1024, max. running jobs 1000, max. jobs in all states together 5000 (R+CG+PD)
    • these values are our suggested defaults - please raise them only if you had a look on the user code/jobs/productions
Topic attachments
I Attachment Action Size Date Who Comment
hello_world.cc hello_world.c manage 1 K 2014-05-16 - 13:44 VictorPenso Simple MPI program
segfaulterEXT segfaulter manage 915 bytes 2014-05-22 - 15:30 VictorPenso Script to create a broken executable program called segfaulter.
Topic revision: r54 - 2021-01-07, ChristopherHuhn - This page was cached on 2023-03-23 - 11:21.

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding GSI Wiki? Send feedback | Legal notice | Privacy Policy (german)