Grid Engine Usage
Prometheus/Hera decommissioning:
This guide will introduce you to the cluster computing facilities provided by GSIs HPC division. After following this document you should be able to run compute jobs on the cluster, handle log information and data output.
Introduction
The GridEngine clusters are accessible via a pool of
login nodes. Login to Prometheus with
pro.hpc.gsi.de (User requiring nodes with Infiniband to build their applications login to
pro-ib.hpc.gsi.de
.)
The login nodes provide all tools (compilers) to build software to be executed on the clusters. All nodes in Prometheus are running
Debian 6 (Squeeze) 64Bit.
Storage (Lustre)
Each cluster uses dedicated shared storage (
Lustre) available to all compute and login nodes. On the Prometheus cluster a dedicated shared storage is available in the directory
/hera/
(~5PB). (Move data between
/hera/
and
/lustre/
using the login nodes of Prometheus.)
In order to gain write access to the shared file systems contact the computing coordinator of your working group. (In case you are not associated to a particular working group write a mail to
hpc@gsi.de.)
Software repositories
Scientific software (libraries) are available in sub-directories called
/cvmfs/*.gsi.de/ on all nodes of the clusters, cf.
CVMFS Every software repository consists of a directory called like a department or experiment followed by the GSI domain name "gsi.de". For example, the IT Departments supports
/cvmfs/it.gsi.de
, with software available for all users. Experiment groups have repositories like
/cvmfs/hades.gsi.de
or
/cvmfs/alice.gsi.de
.
Submit Your First Job
This paragraph will help you to run your first job. For this example we will use Prometheus. Login via
SSH to
pro.hpc
, and execute a couple of commands to gain an overview of the current state of the system:
» ssh pro.hpc
[...SNIP...]
jdow@lxsub08:~ » qstat -s r -u '*'
[...SNIP...]
jdow@lxsub08:~ » qhost
[...SNIP...]
The command
=qstat= shows the status of all jobs in the cluster. By default it will limit its output to jobs belonging to the user executing the command. The option
-u '*'
will print jobs from all users, while the option
-s r
limits the output to actually running jobs. Similarly you can list all compute nodes belonging to a cluster with the command
qhost
.
It is possible to submit an executable directly as job into the cluster. But we recommend to
use a wrapper script always. It will help you tremendously to reproduce the job execution later, especially in case of problems. The following code provides a very simple example of an ordinary shell script printing some output and waiting for 60 seconds:
#$ -wd /tmp
#$ -j y
format='+%Y/%m/%d-%H:%M:%S'
# Identity of the compute node executing the job
echo [`date $format`] $USER@`hostname`
# Something to execute
sleep 60
# Finishing the job
echo [`date $format`] exit state $?
Anyone familiar with shell-scripts will immediately understand (with the exception of comments prefixed with
#$
which will be explained later). In order to run this script as a job in the cluster create a file
sleeper.sge
in your personal directory on the shared storage. Since we run on the Prometheus cluster it will be located in a sub-directory of
/hera/
. In this example a user "jdow" stores files into a directory
/hera/hpc/jdow/
. If you follow this example you will need to adjust the path to the shared storage accordingly!
Submit a job to the cluster with the command qsub
.
The options
-o /path/to/file.log
specifies the location to write the job output to. The last parameter is the wrapper-script to submit.
jdow@lxsub08:~ » qsub -o /hera/hpc/jdow/sleeper.log /hera/hpc/jdow/sleeper.sge
Your job 1534916 ("sleeper.sge") has been submitted
jdow@lxsub08:~ » qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
1534916 1.00000 sleeper.sg jdow r 06/27/2013 13:20:49 default@lxb858.gsi.de 1
jdow@lxsub08:~ » cat /hera/hpc/jdow/sleeper.log
[2013/06/27-13:20:49] jdow@lxb858
[2013/06/27-13:21:49] exit state 0
The job name is usually the name of the script/executable that you have submitted, you can use an other name via
'qsub -N testjob...' ,
be aware that numbers at the begin of the name are not accepted. If your job has been accepted by the system its job identification number will be printed.
Check the state of your job with
qstat
. As mentioned before by default only your jobs will be shown. The "state" column indicates the current state of your job. You will find a complete list of all possible states in the man page of
qstat
. For now, just
two states are important to know qw
(queue wait) and r
(running). The first indicates that the job is waiting to be executed as soon as a suitable resources become available. Depending on the load of the cluster this may take a while. If your jobs disappears from the list it has been removed from the cluster, most likely because it was finished.
Track Your First Problem
Hopefully by now you have been able to successfully execute a job on the cluster. It is very likely that at some point in the future you will face the situation that your jobs break for whatever reason. One of the difficulties in these cases is to
distinguish between problems internal to your application and issues caused by the execution environment. The following code is a wrapper script used to submit a job executing a program called
segfaulter
. This program will run for 20 seconds and break badly!
#$ -wd /tmp
#$ -j y
# Function to print log messages
_log() {
local format='+%Y/%m/%d-%H:%M:%S'
echo [`date $format`] "$@"
}
# Identity of the compute node executing the job
_log $JOB_ID Running on $USER@`hostname`:$PWD
# A faulty program
/hera/hpc/jdow/segfaulter &
# The process ID of the last spawned child process
child=$!
_log Spawn segfaulter with PID $child
# Wait for the child to finish
wait $child
# Exit signal of the child process
state=$?
_log Finishing with $state
# Propagate last signal to the system
exit $state
Let's take a closer look to this wrapper script. First it collects some information about the execution environment, the
JOB_ID
(which is defined by
GridEngine), user account name, execution host name and working directory. In a real application it would be necessary to check for all dependencies before launching the actual program, including input data, libraries and so on. Once the program starts the process ID is logged and upon finishing its exit signal is recorded. If the job has failed during execution, this information will help to understand what happened.
jdow@lxsub08:~ » qsub -o /hera/hpc/jdow/segfaulter.log /hera/hpc/jdow/segfaulter.sge
Your job 1542056 ("segfaulter.sge") has been submitted
[...SNIP...]
jdow@lxsub08:~ » cat /hera/hpc/jdow/segfaulter.log
[2013/06/27-15:29:06] 1542056 Running on jdow@lxb869:/tmp
[2013/06/27-15:29:06] Spawn segfaulter with PID 7978
/var/spool/gridengine/execd/lxb869/job_scripts/1542056: line 16: 7978 Segmentation fault (core dumped) /hera/hpc/jdow/segfaulter
[2013/06/27-15:29:26] Finishing with 139
We can see the job identifier, when and where it was executed, what was executed (including its process ID), and how it died. That is the minimum amount of information required for decent debugging of problems.
Submitting Jobs
Users have extensive control about the job execution environment. Generally the command to submit (groups of) jobs is of the following form:
qsub [ OPTIONS ] [ EXECUTABLE | -- [ ARGUMENTS ]]
The EXECUTABLE is the path to the program you want to run, and you can pass optional ARGUMENTS (parameters) to your program. (If no executable is given the
qsub
command puts the cursor on the following line and waits for manual commands from you. These commands, typed one by one, will constitute the job to be executed. Input commands are closed by Ctrl-D.)
A batch job typically consists of shell scripts with a set of information concerning your job and the call to the main application program to execute. You can specify
qsub options inside your script. These
options should be given at the beginning of the script, before all other lines of your job. Each line containing options should start with the prefix
#$ YOUR_OPTION
. (You can change this prefix with option
-C PREFIX
.)
Working Directory
As described above we do not support shared home-directories on the cluster. Therefore you need to follow a couple of simple rules. First and foremost it is required to
define your job working directory to be located in /tmp/
. One way to do that is by submitting jobs with the parameter
-cwd
when your current working directory is in
/tmp/
. But it is better to
define it with -wd /tmp
in order to allow you to submit your jobs from any directory.
Whatever output you produce while your job is running, you need to make sure to move it to the shared file-system (persistent storage) in the
LustreFs if you want to keep it. Data only needed during the execution should be stored using the environment variable
$TMPDIR
. This variable is set by GridEngine for each job individually.
Scratch Space & Log-files
GridEngine redirects to standard error and output to dedicated files. The path to both can be defined using the parameters
-o PATH
and
-e PATH
. Optionally you can
join both output streams with the option -j y
. Each job gets a dedicated directory serving a local scratch space accessible by the path defined in $TMPDIR.
The following job merges standard output and error, and copies the log file from the execution host to the persistent storage at the end. It uses the environment variable
$SGE_STDOUT_PATHholding the path to the log file.
#$ -wd /tmp
#$ -j y
# Path to persistent storage
_target=/lustre/<PATH>
# Generate random numbers on local storage
cat /dev/urandom | tr -dc '0-9' |\
fold -w 50 | head -n 50 > $TMPDIR/$JOB_ID.data
# Move the data to the persistent storage
cp -v $TMPDIR/* $_target
# Evacuate the log-file
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.log
The random numbers calculated by this job are stored on the execution host while they are processed. The resulting file containing the output data is moved from the scratch space to the persistent storage before the job finishes. Notice that in order to execute this job you need to replace
<PATH>
with an directory in
LustreFs you have write access to.
This is just one of the possibilities how you can handle log and output data. Depending on the nature of our computing jobs you may want to write output immediately to the persistent storage on the
LustreFs.
Resource Requirements
Instead of defining a specific queue when submitting jobs, you need to define resource requirements. According to this information GridEngine will find the right execution node, and will be able to utilize the cluster more efficiently.
$ qsub -l mem=3G,ct=03:00:00 ...
Define
resource limits with the
-l resource=value,..
option to the
qsub command. The example above defines a CPU-time limit
ct of 3 hours and a maximum memory usage
mem of 3GB.
You can
verify if the resource management system can fulfill your resource requirements using the option
-w v
. If the system cannot schedule your job it will answer with "
verification: no suitable queues". In case there are existing resources to execute your requirements the answer is "
verification: found suitable queue(s)".
The execution time expressed will not take into account the difference in terms of performance between the different machines in the farm. You should require CPU time with an sufficient safety margin. There are two formats of definition, either in seconds or alternatively a string like
hours:minutes:seconds
.
-l ct=3600
-l ct=01:30:00
When defining memory limits you always should use a unit. Possible units are G,M, K, and B.
If you have a hang on dedicated nodes that you (don't) want to use :
$ qsub -l hostname=lxb1000 ... //will use ONLY lxb1000 for this job
$ qsub -l hostname=!(lxb1000) ... //will NOT use lxb1000 for this job
Please keep in mind that it is usually better to use both (CPU and/or
RunTime AND real and/or virtual memory) resource requirements, otherwise it is likely that your job will go to the queue with the lowest resource limits which satisfies your requirements.
So if you do qsub -l v_mem=5G ...
the job will go to one queue with equal or more than 5 GB of memory but the runtime can be 1 , 8 or 24 hours.
Submitting Executables
Submit executable scripts (Perl, Python or Ruby) or binary files by adding the option
-b y
to the
qsub
command, otherwise the system will prevent you from executing non shell-scripts. Furthermore you need to
locate the binary/script in the LustreFs and define the absolute path to it when submitting.
qsub -b y /path/in/lustre/to/your/executable.py
Array Jobs
Instead of submitting each working task as a single job, it is possible to submit
a single job spawning as many tasks as defined. The following example is the same random number generator as in the previous section, but with the capability to produce as many random number samples as the user specifies using the
-t RANGEoption:
#$ -wd /tmp
#$ -j y
# Name of this job
#$ -N data
# Number of sub-tasks to start
#$ -t 1-10
_target=/lustre/<PATH>
cat /dev/urandom | tr -dc '0-9' |\
fold -w 50 | head -n 50 > $TMPDIR/$JOB_ID.$SGE_TASK_ID.data
cp -v $TMPDIR/* $_target
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.$SGE_TASK_ID.log
The only difference between each execution is an environment variable
$SGE_TASK_ID
. Use this variable to distinguish task dependent output files. Remember that GridEngine will start many tasks simultaneously, which means you cannot rely on anything provided by another task.
Limit the number of tasks with the option
-tc NUMBER_OF_TASKS
. It defines the
maximum number of tasks which will be executed in parallel.
Running multiple jobs in parallel is a extremely common task on clusters. Generally it is possible to submit each job individually by writing your own loop around the job submission procedure. But, take into account that each job consumes resources on the
GridEngine master. When scheduling thousands of jobs this can be a significant burden up to the point that the system will slow down or even crash. Therefore it is beneficial for the cluster and all users if multiple working tasks are submitted as an array job.
Additionally it is easier to control your work because the output of the batch system is better viewable if using job arrays instead of single jobs.
It is possible to remove a single task from an array job using
qdel JOB_ID.SGE_TASK_ID
, or the entire array using its job ID.
Job Dependencies
Maybe you noticed that we have added
-N NAMEto the example from the last paragraph. This option defines a name for a (array) job. It is possible to use this name as a reference for defining execution dependencies between jobs.
#$ -wd /tmp
#$ -j y
#$ -N merge
#$ -hold_jid data
# Path to persistent storage
_target=/lustre/<PATH>
# Merge all random numbers
for f in $_target/*.data
do
echo "Processing file $f "
cat $f | grep -v '^0' >> $TMPDIR/$JOB_ID.merge.data
done
cp -v $TMPDIR/* $_target
In this example the job waits for another job called "data" by defining the
-hold_jid
option.
Parallel Jobs
What are parallel jobs? Generally each job is assigned a single CPU. If an application is developed to utilize many CPUs within a single program, then the user needs to specify this resource requirement when submitting it as a job. GridEngine supports multiple so called "parallel environments" used for different types of multi-core applications (use qconf -spl
to see a list of provided PEs).
If you want to execute a program based on shared memory access by multiple parallel threads on a single host (e.g. implemented with
OpenMP), use the
SMP parallel environment with the submit option
-pe smp CORES
. This will ensure that the number of
CORES you have defined will be allocated on
a single execution node.
Applications based on a distributed-memory architecture communicating using
MPI are submitted to the
OpenMPI parallel environment with the option
-pe openmpi CORES
. They will be
distributed among as many execution nodes as GridEngine identifies as suitable.
Get started with the following simple script that you can utilize to submit your MPI program. You will need to replace
PATH_TO_SHARED_STORAGE
with the path to your directory on the Lustre file systems. For the purpose of this example we call this script
mpi_submit.sge
:
#!/bin/bash
#$ -j y
#$ -pe openmpi 50
#$ -N mpi_test
#$ -wd /tmp
_target=PATH_TO_SHARED_STORAGE
_exec=$1
format='+%Y/%m/%d-%H:%M:%S'
echo `date $format` Starting...
echo Hello from $USER@`hostname`:`pwd`
echo Starting MPI process
mpirun $_exec
echo See you!
echo `date $format` ..finished.
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.log
Assuming you are logged on to one of the submit nodes, compile your MPI program (using for example the
mpic++
compiler) and verify that it executes properly (by using
mpirun -np 4
). This would look something like the following:
$ mpicc -o hello_world.o hello_world.c
$ mpirun -np 4 hello_world.o
Hello world from process (pid 32213) rank 1 (of 4 processes)
Hello world from process (pid 32212) rank 0 (of 4 processes)
Hello world from process (pid 32214) rank 2 (of 4 processes)
Hello world from process (pid 32215) rank 3 (of 4 processes)
$ cp hello_world.o /lustre/rz/jdoe/bin/
$ qsub mpi_submit.sge /lustre/rz/jdoe/bin/hello_world.o
If your program executes well copy the compiled binary file to your shared storage directory (in the example above
/lustre/rz/jdoe
) and submit a test job using the
mpi_submit.sge
script followed by the path to your program executable.
Monitor the state of your job by adding the -g t
option to _qstat_ to show slave processes also.
Monitoring Jobs & Troubleshooting
While jobs are in the batch system, users can access monitoring information using
qstat. When no options are applied it will list your jobs and their state.
Possible jobs states:
- (t)ransfering
- (r)unning
- (R)estart
- (s)uspended
- (S)uspended by queue
- Suspended queue (T)hreshold reached
- (w)aiting for free resources
- (h)old execution until dependency is ready
- (e)rror
- (E)rror
- (q)ueued
Details about the job state can be displayed using the
-j JOB_ID
option. This will tell you why the job is pending and if there are any reasons why queues cannot accept the job. Currently it is not possible to have access to the log-files from GE, this means the
qacct command won't work.
Job Priorities
Before the GridEngine starts to dispatch jobs, the jobs are brought into priority order, highest priority first. The system then attempts to find suitable resources for the jobs in priority sequence. Priorities are calculated by the share defined for your user group and the resources already consumed by your jobs in the past.
You can
display the priority of your jobs in the system using the command:
qstat -ext -u '*'
Basic Debugging Process
Verify for yourself that the cluster and the batch system are happy before you do anything else. To that end, print an
overview of the queues and their utilization:
qstat -g c
The last column indicates hosts in problematic state. Get a
list of the hosts running your jobs with:
qhost -q -j
Use the command
qstat to ask for the
resource consumption of a job:
qstat -j JOB_ID | grep usage
If these commands do not indicate any problems, verify that your application actually runs on one of the
InteractiveMachines, by executing it locally. Most common causes for a job running on the command-line of our submit-hosts but not on the cluster are:
- Differences in environment variables.
- Differences in the shell execution environment.
- Incorrect permission on input files.
- Use of shared file-systems only available on the submit-host, such as
/u/
.
If a job has been in the running state and died for any reason, the best resource for debugging is the standard output/error log-files.
You can get an email explaining why a job is aborted by adding the option
-m j.doe@gsi.de
.