Kronos Usage
Kronos decommissioning:
Kronos has been decommissioned on
2020-31-12 See
https://hpc.gsi.de/virgo on how to migrate to the new cluster
Virgo.
Please read Reporting Issues before contacting the HPC group.
Use the alias
kronos.hpc.gsi.de
for login to the interactive cluster nodes.
Before you start - If you are new to Slurm@Kronos
Apply for a Slurm account
Unlike on its predecessor
Prometheus, a Slurm user account is needed for using Slurm@Kronos.
You can apply for it by asking your
account coordinator of your department/group.
If no such coordinator exists for your group, just open a Trouble Ticket in our
ticket system.
As information we need your user name and the group (account) to which you belong (e.g.
alice
,
radprot
,
hades
, ...).
This should be your Linux account's primary group (
gid
), but it may differ. Nevertheless it should be one of your secondary groups (check with
id
).
If the Slurm account you want to join is neither the primary nor a secondary group of your Linux account please open a ticket in the
IT trouble ticket system (
accounts-service@gsi.de) to get this fixed.
The accounting group (as the name says) is the account to which your used run time will be accounted to, which in turn influences your job priority (together with other parameters).
If you want to become an
account coordinator read
below.
Introduction
Introduction to Slurm video series by the developers.
List of basic Slurm commands:
Command |
Description |
sinfo |
Provides information on cluster partitions and nodes. |
squeue |
Shows an overview of jobs and their states. |
scontrol |
View Slurm configuration and state, also for un-/suspending jobs. |
srun |
Run an executable as a single job (and job step). Blocks until the job is scheduled. |
salloc |
Submits an interactive job. Blocks until the job is scheduled, and the prompt appears. |
sbatch |
Submits a job script for batch scheduling. Returns immediately with job ID. |
scancel |
Cancels (or signals) a running or pending job. |
sacct |
Display data for all jobs and job steps in the accounting database |
Storage
The cluster has access to shared storage in the directory /lustre/
on all nodes.
Shared storage is build with the
Lustre parallel distributed file system.
Make sure to read →
LustreFs for a more detailed description.
Access permission to the file-systems directories are granted by the coordinators of the experiment groups and departments. Users not associated to a particular user group should open a ticket in the
IT trouble ticket system.
Software
Scientific software is available in a directory called
/cvmfs/
. Each sub-directory contains a cached copy of a software repository maintained by a department or experiment group. Find a list available software at the page:
→
SoftwareInCvmfs
Software within
/cvmfs/
can be loaded to the job environment using
module
:
→
CVMFS
→
EnvironmentModules
Submit Your First Job
Before you submit your first job, apply for an account in Slurm.
Access permission to the computing cluster is granted by the coordinators of the experiment groups and departments. Users not associated to a particular user group should contact the HPC Department via the ticket system.
This document illustrates examples with a fictive user called "jdow".
Follow the examples bellow by adjusting user name and directory paths accordingly!
The sinfo
command displays nodes and partitions
Use the options
-lNe
with sinfo to view more informations about the available resources:
» sinfo -lNe
Thu May 22 13:55:41 2014
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
lxb[1193-1194,1196-1197] 4 main* idle 64 4:8:2 258461 801785 1 Xeon,E55 none
The example above shows basic resources limits of the execution nodes, 4 sockets with 8 cores each supporting hyper treading (S:C:T) for a total of 64 available CPUs. Furthermore available memory and local storage. Be aware that these are NOT the free resources. The
default partition main*
is marked with an asterisk.
Batch scripts are submitted to the cluster using the sbatch
command.
The meta-command
#SBATCH
inside the job script marks Slurm configuration options:
- Option --time=X-XX:XX:XX defines the run time limit for your job. If not set, the job will be killed after the default run time limit of the used partition (which is much lower than the max. run time limit). Setting a proper limit is important for the scheduling order.
- Best practice is to run your workload a few times on the interactive part of the Kronos cluster to get an idea of how much time your jobs need, then set a safety margin of 50-100% on top of it and, in the end, submit your jobs with the needed run time limit to the destination partition.
- Option -p name defines the partition name on which your jobs should run. If no partition name is defined, all jobs will be executed on the main partition even if the requested run time limit is higher than the one that the main partition provides. In other words : SLURM will NOT select a proper partition for your jobs by itself.
- Option
-D path
defines the working directory. In case of missing access privileges for the working directory, Slurm falls back to /tmp/
.
- Options
-o path
and -e path
define files to store output from stdout and stderr respectively. Use %j
(JOBID) and %N
(name of first node) to automatically adopt the file name to a job (by default stdout goes to stdout-%j.out
). These directories MUST exist at job start time, they will not be created by SLURM.
In this example the user jdow stores files in /lustre/nyx/hpc/jdow/
.
#!/bin/bash
# Task name
#SBATCH -J test
# Run time limit
#SBATCH --time=4:00:00
# Working directory on shared storage
#SBATCH -D /lustre/nyx/hpc/jdow
# Standard and error output in different files
#SBATCH -o %j_%N.out.log
#SBATCH -e %j_%N.err.log
# Execute application code
hostname ; uptime ; sleep 30 ; uname -a
Submit the script as a job to the cluster using a file called
test.sh
in a directory on the shared storage.
» sbatch test.sh
Submitted batch job 46
The system answers with the
job identification number (JOBID) when the job has been accepted.
The command squeue
prints the state of the scheduling queue.
» squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
46 debug test jdow R 0:04 1 lxb1197
» cat 46_lxb1197.out.log
lxb1197
11:01:16 up 2 days, 18:25, 3 users, load average: 0.00, 0.01, 0.05
Linux lxb1197 3.2.0-4-amd64 #1 SMP Debian 3.2.54-2 x86_64 GNU/Linux
The "ST" state column indicates the current state of your job. For the beginning two states are important to know
PD
pending, and
R
running. The first indicates that the job is waiting to be executed as soon as a suitable resources become available. Depending on the load of the cluster this may take a while. If your jobs disappears from the list it has been removed from the cluster, most likely because it finished.
Track Your First Problem
In case of a failure during job execution, it is important to
distinguish between a problem internal to the application and an issue of the job execution environment. The following code demonstrates a wrapper script that collects environment information useful for later debugging. It executes a program
segfaulter which breaks with an internal error:
#!/bin/sh
# Task name
#SBATCH -J breaker
# Working directory on shared storage
#SBATCH -D /hera/hpc/jdow
# Standard and error output in different files
#SBATCH -o breaker_%j_%N.out.log
#SBATCH -e breaker_%j_%N.err.log
# Function to print log messages
_log() {
local format='+%Y/%m/%d-%H:%M:%S'
echo [`date $format`] "$@"
}
_log Job $SLURM_JOB_ID \($SLURM_JOB_NAME\)
_log Submitted from $SLURM_SUBMIT_DIR
# Identity of the execution host
_log Running on $USER@`hostname`:$PWD \($SLURM_JOB_NODELIST\)
# A faulty program
/hera/hpc/jdow/segfaulter &
# The process ID of the last spawned child process
child=$!
_log Spawn segfaulter with PID $child
# Wait for the child to finish
wait $child
# Exit signal of the child process
state=$?
_log Finishing with $state
# Propagate last signal to the system
exit $state
Helpful informations are the
job identification number (line 19), the
user account name, the
execution host name and the
submit directory (line 20). For most applications it will be necessary to check for more dependencies before starting the application program, for example the availability of input data and libraries dependencies. Furthermore the script logs the
application process ID (line 28) and the program
exit signal (line 30-34) before it is propagated to Resource Management System.
» sbatch breaker.sh
Submitted batch job 267
» cat breaker_267_lxb1193.out.log
[2014/05/22-16:09:04] Job 267 (breaker)
[2014/05/22-16:09:04] Submitted from /hera/hpc/jdow/
[2014/05/22-16:09:04] Running on jdow@lxb1193:/hera/hpc/jdow (lxb1193)
[2014/05/22-16:09:04] Spawn segfaulter with PID 52450
[2014/05/22-16:09:04] Finishing with 139
Above you can see the log of the failed segfaulter program. It includes the host process ID, as well as the last process exit signal number.
The command sacct
shows accounting data for finished jobs.
» sacct -j 267
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
267 breaker main hpc 1 FAILED 139:0
267.batch batch hpc 1 FAILED 139:0
The exit code column presents the
exit signal (aka exit status, return code and completion code), followed by the signal that caused the process to terminate if it was terminated by a signal. For sbatch jobs, the exit code that is captured is the output of the batch script. For salloc jobs, the exit code will be the return value of the exit call that terminates the salloc session.
Cluster Resources
General information on the cluster status is available using the commands
sinfo
and
scontrol
. Customize the output of
sinfo by selecting columns with option
-o
.
List
resources and features of execution nodes with:
» sinfo -o "%4c %5z %8d %8m %25f %N"
CPUS S:C:T TMP_DISK MEMORY FEATURES NODELIST
64 4:8:2 801785 258461 Xeon,E5520,Infiniband lxb[1193-1194,1196-1197]
Show a list of
draining(this node will not accept new jobs) and down (not available) nodes:
» sinfo -o '%10n %8T %20H %E' -t 'drain,down'
HOSTNAMES STATE TIMESTAMP REASON
lxb1193 draining 2014-08-22T10:01:27 Update of Slurm to 2.6.5
lxb1194 draining 2014-08-22T10:01:27 Update of Slurm to 2.6.5
lxb1196 down 2014-08-22T10:00:43 Update of Slurm to 2.6.5
lxb1197 down 2014-08-22T10:00:43 Update of Slurm to 2.6.5
Display
run-time constrains and partition sizes:
» sinfo -o "%9P %6g %10l %5w %5D %13C %N"
PARTITION GROUPS TIMELIMIT WEIGH NODES CPUS(A/I/O/T) NODELIST
main* all 1-00:00:00 1 3 0/192/0/192 lxb[1193-1194,1196]
debug all 1:00:00 1 1 0/64/0/64 lxb1197
The "TIMELIMIT" column use the format:
day-hours:minutes:seconds
.
Partitions
Slurm does not use the notion of queues like other Resource Managment Systems, but instead uses partitions which serve a similar role.
Partitions group nodes into logical (possibly overlapping) sets. They have an assortment of constrains like job size or time limit, access control, etc. Users will need to understand the concept of partitions to allocate resources. By default
sinfo
lists partitions in the first column. The command
scontrol
shoes more
detailed informations on partitions and execution nodes.
» scontrol show partition main
PartitionName=main
AllocNodes=ALL AllowGroups=ALL Default=YES
DefaultTime=01:00:00 DisableRootJobs=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=1
Nodes=lxb119[3,4,6,7]
Priority=1 RootOnly=NO Shared=NO PreemptMode=OFF
State=UP TotalCPUs=256 TotalNodes=4 DefMemPerNode=UNLIMITED MaxMemPerCPU=4096
The
default partition is marked with an asterisk, e.g. "main*"
» sinfo -o "%9P %6g %10l %5w %5D %13C %N"
PARTITION GROUPS TIMELIMIT WEIGH NODES CPUS(A/I/O/T) NODELIST
main* all 1-00:00:00 1 3 0/192/0/192 lxb[1193-1194,1196]
debug all 1:00:00 1 1 0/64/0/64 lxb1197
Job can be send into a defined partition using the option
--partition
with
srun,
salloc, and
sbatch.
CPU Allocation
Refer to the Slurm "
CPU Management User Guide" for a more detailed description beyond the scope of the following section.
Terminology:
- CPU - On multi-core systems CPUs are cores. No notion of sockets, cores, or threads.
- Socket - A physical group of processors, usually containing multiple cores.
- Core - A single processor unit.
- Thread - One or more execution contexts within a single core.
- Affinity - The state of binding a process to a physical core.
- Task - A logical group of resources required to execute a program(/process).
By default a job consists of a single task allocating a single core.
The following table lists the options for the commands
srun
,
salloc
, and
sbatch
used to control the allocation of CPU resources:
Option |
Description |
-n, --ntasks=<number> |
Number of tasks to start. (default=1) |
--ntasks-per-node=<ntasks> |
Number of tasks to invoke on each node. (default=1) |
--ntasks-per-core=<ntasks> |
Number of tasks to invoke on each core . |
-c, --cpus-per-task=<ncpus> |
Number of CPUs per process (default=1). Useful for multi-threaded applications. |
--threads-per-core= |
Number of threads per core to allocate. (default 1) |
-N, --nodes=<minnodes[-maxnodes]> |
Number of nodes to allocate. |
-O, --overcommit |
Explicitly allowing more than one process per CPU. |
--exclusive |
Allocates nodes exclusively to a job |
In the following example the command
srun is used to execute
hostname. The option
--ntasks
is applied to gradually increase the number of tasks. Each task allocates a core, hence when the number of required tasks exceeds the capabilities of a single node (here 32 cores), the job is spread to more nodes.
» srun hostname
lxb1193
» srun --ntasks 4 hostname
lxb1193
lxb1193
lxb1193
lxb1193
» srun --ntasks 32 hostname | sort | uniq -c
32 lxb1193
» srun --ntasks 64 hostname | sort | uniq -c
32 lxb1193
32 lxb1194
» srun --ntasks 128 hostname | sort | uniq -c
32 lxb1193
32 lxb1194
32 lxb1196
32 lxb1197
Jobs can
allocate individual sockets, cores and threads as consumable resource. The default allocation method across nodes is
block allocation (allocate all available CPUs in a node before using another node). The default allocation method within a node is cyclic allocation (allocate available CPUs in a round-robin fashion across the sockets within a node). The option
--ntasks-per-node
enables users to distribution a specific number of tasks to nodes.
» srun --ntasks 4 --ntasks-per-node 2 hostname
lxb1194
lxb1194
lxb1193
lxb1193
» srun --ntasks 4 --ntasks-per-node 1 hostname
lxb1194
lxb1197
lxb1193
lxb1196
Generic Resources & Features
Node can have associated
Features to indicate node characteristics, as well as
Generic Resources (GRES) representing specific hardware devices (e.g. GPGPUs). List resources and features of execution nodes with:
» sinfo -o "%4c %10z %8d %8m %10f %10G %D"
CPUS S:C:T TMP_DISK MEMORY FEATURES GRES NODES
40 2:10:2 550000 128000 (null) (null) 150
40 2:10:2 550000 256000 hawaii gpu:4 131
40 2:10:2 550000 256000 tahiti gpu:4 9
Users can allocate GRES with option
--gres=<list>
, and features with
--constrain=<list>
:
[…] --gres=gpu:1 --constrain=tahiti […]
[…] --gres=gpu:2*cpu […]
GRES defined with a string following the pattern
name[:count[*cpu]]
:
- name – Identifier name of the consumable resource
- count – Number of resources (default 1).
- *cpu – Allocate specified resource per CPU (instead of job on each node).
Array Jobs
Job arrays are
only supported for batch jobs.
Array index values are specified using the
option -a
for the sbatch
command.
» sbatch --array=1-5 […]
Submitted batch job 23
» squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
23_1 main test jdow R 0:04 1 lxb001
23_2 main test jdow R 0:04 1 lxb001
23_3 main test jdow R 0:04 1 lxb001
23_4 main test jdow R 0:04 1 lxb001
23_5 main test jdow R 0:04 1 lxb001
» sbatch -a=2,4,6,8 […]
[…]
» sbatch -a=0-8:2 […]
Submitted batch job 32
» squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
32_0 main test jdow R 0:01 1 lxb001
32_2 main test jdow R 0:01 1 lxb001
32_4 main test jdow R 0:01 1 lxb001
32_6 main test jdow R 0:01 1 lxb001
32_8 main test jdow R 0:01 1 lxb001
Array jobs will have the
environment variable SLURM_ARRAY_TASK_ID
set to its array index value. Note that all array jobs share a common
SLURM_ARRAY_JOB_ID
, while having an individual
SLURM_JOBID
. Commands like
squeue show the array job ID followed by its index number with an underscore as delimiter, e.g. above "23_5". Use the
markers %A_%a
to format output/error file names with option
-o
:
» sbatch -a=1-3 -o slurm-%A_%a.out.log […]
[…]
» scancel 23_1 23_4
[…]
» scancel 32_[2-6]
[…]
Use the array job ID to cancel all tasks with
scancel
, or append the array indexes for specific tasks like it is demonstrated above.
Pending jobs are combined into a single line by the
squeue
command use option
-r
to expand this list.
» squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
47_[24-100] main test jdow PD 0:00 1 (Resources)
47_1 main test jdow R 0:08 1 lxb001
47_2 main test jdow R 0:08 1 lxb001
47_3 main test jdow R 0:08 1 lxb001
47_4 main test jdow R 0:08 1 lxb001
[…]
Parallel Applications
You may want to load an appropriate version of OpenMPI from
CVMFS using
EnvironmentModules, i.e.:
>>> source /etc/profile.d/modules.sh
>>> module use /cvmfs/it.gsi.de/modulefiles/
>>> module load openmpi/gcc/1.10.0
>>> which mpicc mpirun
/cvmfs/it.gsi.de/openmpi/gcc/1.10.0/bin/mpicc
/cvmfs/it.gsi.de/openmpi/gcc/1.10.0/bin/mprun
Compile a simple MPI program
hello_world.c with
mpicc
and execute it with
mpirun
:
>>> mpicc -o hello_world hello_world.c
>>> mpirun -np 4 hello_world
Interactive
Execute the same program on the cluster by allocation resources with
salloc
:
>>> salloc -p debug -N2 -n40 bash
salloc: Granted job allocation 162
$ mpirun hello_world
Hello world lxb1193.2918 [15/40]
Hello world lxb1194.64133 [35/40]
Hello world lxb1193.2905 [7/40]
Hello world lxb1193.2908 [10/40]
Hello world lxb1193.2898 [0/40]
Hello world lxb1194.64151 [52/40]
Hello world lxb1193.2919 [16/40]
Hello world lxb1194.64150 [51/40]
Hello world lxb1193.2934 [22/40]
Hello world lxb1193.2907 [9/40]
[…]
$ exit
salloc: Relinquishing job allocation 162
Note that you can not use the srun
command; it is not usable with MPI programs!
Batch
Simple batch script used to start an MPI application:
#!/bin/bash
#SBATCH -D /lustre/nyx/hpc/jdow
#SBATCH -o %j_%N.out.log
#SBATCH -e %j_%N.err.log
# Resource requirements for parallel execution
#SBATCH -N 2
#SBATCH -n 40
#SBATCH -p debug
# Load the required Open MPI environment
source /etc/profile.d/modules.sh
module use /cvmfs/it.gsi.de/modulefiles/
module load openmpi/gcc/1.10.0
# Execute the application
mpirun hello_world
Job Management
Common Job Options
Slurm commands print a summery of
supported options using options
--usage
:
» sinfo --usage
Usage: sinfo [-abdelNRrsv] [-i seconds] [-t states] [-p partition] [-n nodes]
[-S fields] [-o format]
Job meta data, output streams, mail hooks, etc.:
Option |
Description |
-o, --output |
Specify a file name that will be used to store all normal output (stdout), you can use %j (job id) and %N (name of first node) to automatically adopt the file name to the job, per default stdout goes to "stdout-%j.out" |
-e, --error |
Specify a file name that will be used to store all error output (stderr). (See above) |
--job-name |
Name of the job (24 characters max.) |
--comment |
Attach a comment to the job |
--email-type |
Events triggering a mail (BEGIN,END,FAILED) |
--email-user |
Receiver email address (a fully qualified email address is required, e.g.: j.doe@gsi.de) |
--begin |
Delay the start of a job, e.g. 16:00 , now+1hour , or 2014-01-02T13:15:00 |
Resource allocation options:
Option |
Description |
-M, --clusters |
Cluster name (one or many) |
-p, --partition |
Partition name (on or many) |
--mem |
Memory per node, in MB |
--mem-per-cpu |
Memory per cpu, in MB |
--gres |
Generic resources (e.g. GPUs) |
--licenses |
Software licenses |
Options to set restrictions and/or constrains:
Option |
Description |
--constrain |
Choose a feature (e.g. Xeon) |
--mincpus |
--tmp |
-d, --dependency=after(ok,notok,any):jobid |
Specify job ordering |
--reservation |
--share |
--contiguous |
--geometry |
-F, --nodefile |
-w, --nodelist |
Restrict jobs to a set of nodes (comma separated list). |
-x, --exclude |
Exclude nodes from a job (comma separated list). |
--switches |
Requeue
Jobs get automatically re-queue if compute nodes fail during execution.
It is possible to alter this behavior using following options with job submission:
Option |
Description |
--requeue |
Default; Automatically requeue a job after node failure. |
--no-requeue |
Prevent the job from being requeued |
The
requeue configuration flag (1 = true) defines this behavior for each job:
» scontrol show job 2415 | grep Requeue
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
» scontrol update jobid=2415 requeue=0
Set
requeue to zero to disable this behavior for a running job.
Monitoring Jobs
Watch the scheduling queue with the squeue
command.
By default the output is limited to the jobs belonging to the user account. Typically the output contains the job identifier used with other command to interact with a running job.
» squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
139 main stress jdow PD 0:00 1 (Resources)
140 main stress jdow PD 0:00 1 (Priority)
138 main stress jdow R 0:03 2 lxb[1194,1196]
137 main stress jdow R 0:11 2 lxb[1193-1194]
135 main stress jdow R 0:12 1 lxb1196
136 main stress jdow R 0:12 1 lxb1196
134 main stress jdow R 0:13 1 lxb1194
132 main stress jdow R 0:14 1 lxb1193
133 main stress jdow R 0:14 1 lxb1194
131 main stress jdow R 0:22 1 lxb1193
Option
-t state
limits the list to jobs in a certain state. Job states (ST) include the following in the output of
squeue.
State |
Code |
Meaning |
PENDING |
PD |
Job is awaiting resource allocation. |
RUNNING |
R |
Job currently has an allocation. |
SUSPENDED |
S |
Job has an allocation, but execution has been suspended. |
COMPLETING |
CG |
Job is in the process of completing. Some processes on some nodes may still be active. |
COMPLETED |
CD |
Job has terminated all processes on all nodes. |
CONFIGURING |
CF |
|
CANCELED |
CA |
Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
FAILED |
F |
Job terminated with non-zero exit code or other failure condition. |
TIMEOUT |
TO |
Job terminated upon reaching its time limit. |
PREEMPTED |
PR |
Job has been suspended by an higher priority job on the same ressource. |
NODE_FAIL |
NF |
Job terminated due to failure of one or more allocated nodes. |
Detailed information about job parameters are determined with the scontrol
command.
» scontrol show job 145
JobId=145 Name=stress
UserId=jdow(3535) GroupId=hpc(1082)
Priority=2 Account=hpc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=00:02:02 TimeLimit=UNLIMITED TimeMin=N/A
SubmitTime=2014-03-26T10:26:29 EligibleTime=2014-03-26T10:26:29
StartTime=2014-03-26T10:26:29 EndTime=Unknown
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=main AllocNode:Sid=lxdv111:10136
ReqNodeList=(null) ExcNodeList=(null)
NodeList=lxb1193
BatchHost=lxb1193
NumNodes=1 NumCPUs=12 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryCPU=2G MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/hera/hpc/jdow/tests/stress.sh 600s 12 500M
WorkDir=/hera/hpc/jdow/tests
Why does my job not start?
Resources
This just means that you/your job have the necessary priority but all ressources (requested by your job) are already allocated at this time.
Priority
All other jobs in the queue before your job have a higher priority.
Dependency
The job is waiting for another job to be finished.
scontrol show job JobID | grep JobState
JobState=PENDING Reason=Dependency Dependency=afterany:anotherJobID
BeginTime
Set a start time in the future for the job.
PartitionTimeLimit
The used partition can't fulfill the requested run time. Please select another partition for the job
scontrol update job=JobID partition=....
JobHeldUser
Either you (, your account coordinator, one of the admins or the system) has suspended your job.
In the later case this means that the job had already been scheduled to the cluster but it has ended with a problem and was requeued.
scontrol show job JobID | grep Restarts
Requeue=1 Restarts=1 BatchFlag =2 ExitCode =0:0A JobHeldUser flag can be released by the associated user itself using
scontrol release JobID
JobHeldAdmin
An administrator has suspended your job, and it can be released only by an admin. Usually you will get a friendly mail after such an action
For a full list of reasons please have a look on
http://slurm.schedmd.com/squeue.html#lbAF
(please keep in mind that there could be differences in the documentation as this one refers to the latest version of SLURM)
Priorities & Shares
Several factors contribute to the calculation of job priorities. Among them job size, partition time, and fair-share. The fair-share factor is calculated based on a defined share for each group and the users inside the groups (so even if you were lazy in the past it could be that your jobs will not get a high priority). The user shares are shares inside their corresponding group. Fair-share considers historical use of the cluster resources to achieve a long-term balancing of resource share. Historical accounting information has a decay half-life (currently 7 days) to reduce the long term effect of past resource.
Display the priority calculation configuration of the cluster controller with
scontrol
.
» scontrol show config | grep ^Priority
PriorityDecayHalfLife = 7-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = 0
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightFairShare = 0
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 0
[…]
List pending jobs sorted by priority with
squeue
:
» squeue -o '%.7i %.9Q %.9P %.8j %.8u %.8T %.10M %.11l %.8D %.5C %R' -S '-p' --state=pending
[…]
Show the priority given to a job with
squeue
:
» squeue -o %Q -j JOBID
[…]
» sprio -w
[…]
The
sshare
command lists the shares of associations.
» sshare
Account User Raw Shares Norm Shares Raw Usage Effectv Usage FairShare
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
root 1.000000 288990 1.000000 0.500000
alice 30 0.297030 0 0.000000 1.000000
cbm 20 0.198020 0 0.000000 1.000000
hades 20 0.198020 0 0.000000 1.000000
hpc 10 0.099010 288990 1.000000 0.000911
hpc jdow parent 0.099010 288990 1.000000 0.000911
panda 20 0.198020 0 0.000000 1.000000
One can lower the priority of it's own jobs (in case of being an account coordinator also jobs from the same group) if you have important and not-so-omportant jobs at the same time :
>scontrol upate job=JobID nice=yyy
Accounting
The command
sacct
reports resource usage for running or terminated jobs including individual tasks, which can be useful to detect load imbalance between the tasks. View a summery with option
-b
:
» sacct -b | tail -20
28 COMPLETED 0:0
29 FAILED 127:0
30 FAILED 127:0
31 COMPLETED 0:0
32 COMPLETED 0:0
33 COMPLETED 0:0
34 COMPLETED 0:0
35 COMPLETED 0:0
36 COMPLETED 0:0
37 CANCELLED+ 0:0
38 FAILED 127:0
39 FAILED 127:0
40 COMPLETED 0:0
41 COMPLETED 0:0
42 COMPLETED 0:0
43 COMPLETED 0:0
44 FAILED 130:0
45 COMPLETED 0:0
46 COMPLETED 0:0
46.batch COMPLETED 0:0
The output is customizable with option
-o
(list available fields with
-e
).
» sacct -j 46.batch -o 'JobID,NodeList,NCPUS,CPUTime,MaxRSS,'
JobID NodeList NCPUS CPUTime MaxRSS
------------ --------------- ---------- ---------- ----------
46.batch lxb1197 1 00:00:30 1964K
» sacct --format "JobID%3,User%10,CPUTime%8,NodeList"
Job User CPUTime NodeList
--- ---------- -------- ---------------
2 jdow 00:00:00 lxb007
3 jdow 00:00:00 lxb[001-004]
4 jdow 00:00:04 lxb[001-004]
5 jdow 00:00:40 lxb[001-004]
6 jdow 00:00:15 lxb[001-003]
Account Coordinators
What's an Account Coordinator?
Account coordinators organize the cluster usage for specific
The account coordinator can:
- create Kronos accounts for users in your group
- distribute group shares over the users
- modify/suspend/delete jobs of all users in your group
There can be more than one Account Coordinator per group, and this is not necessarily a lifetime job
Becoming an Account Coordinator for your group
If you think you are the ideal person for being the account coordinator for your group/department/experiment, you can ask us and we will have a look.
If you are already an Account Coordinator
Usually there are two things to do in the beginning, e.g.:
- sacctmgr add user alice account=atWonderland
- where 'alice' is the Linux user name and 'atWonderland' is the Kronos group the user should belong/account to (usually equal to your own group)
- sacctmgr modify user alice where account=atWonderland set GrpCPUs=1024 GrpJobs=1000 GrpSubmit=5000
- and some changes to restrict the user a little bit : max. CPUs 1024, max. running jobs 1000, max. jobs in all states together 5000 (R+CG+PD)
- these values are our suggested defaults - please raise them only if you had a look on the user code/jobs/productions