Grid Engine Usage

TIP Prometheus/Hera decommissioning:

This guide will introduce you to the cluster computing facilities provided by GSIs HPC division. After following this document you should be able to run compute jobs on the cluster, handle log information and data output.



Introduction

The GridEngine clusters are accessible via a pool of login nodes. Login to Prometheus with pro.hpc.gsi.de (User requiring nodes with Infiniband to build their applications login to pro-ib.hpc.gsi.de.)

The login nodes provide all tools (compilers) to build software to be executed on the clusters. All nodes in Prometheus are running Debian 6 (Squeeze) 64Bit.

Storage (Lustre)

Each cluster uses dedicated shared storage ( Lustre) available to all compute and login nodes. On the Prometheus cluster a dedicated shared storage is available in the directory /hera/ (~5PB). (Move data between /hera/ and /lustre/ using the login nodes of Prometheus.)

In order to gain write access to the shared file systems contact the computing coordinator of your working group. (In case you are not associated to a particular working group write a mail to hpc@gsi.de.)

Software repositories

Scientific software (libraries) are available in sub-directories called /cvmfs/*.gsi.de/ on all nodes of the clusters, cf. CvmFs#Usage. Every software repository consists of a directory called like a department or experiment followed by the GSI domain name "gsi.de". For example, the IT Departments supports /cvmfs/it.gsi.de, with software available for all users. Experiment groups have repositories like /cvmfs/hades.gsi.de or /cvmfs/alice.gsi.de.

Submit Your First Job

This paragraph will help you to run your first job. For this example we will use Prometheus. Login via SSH to pro.hpc, and execute a couple of commands to gain an overview of the current state of the system:
» ssh pro.hpc
[...SNIP...]
jdow@lxsub08:~ » qstat -s r -u '*'
[...SNIP...]
jdow@lxsub08:~ » qhost
[...SNIP...]

The command =qstat= shows the status of all jobs in the cluster. By default it will limit its output to jobs belonging to the user executing the command. The option -u '*' will print jobs from all users, while the option -s r limits the output to actually running jobs. Similarly you can list all compute nodes belonging to a cluster with the command qhost.

It is possible to submit an executable directly as job into the cluster. But we recommend to use a wrapper script always. It will help you tremendously to reproduce the job execution later, especially in case of problems. The following code provides a very simple example of an ordinary shell script printing some output and waiting for 60 seconds:

#$ -wd /tmp 
#$ -j y

format='+%Y/%m/%d-%H:%M:%S'
# Identity of the compute node executing the job
echo [`date $format`] $USER@`hostname`
# Something to execute
sleep 60
# Finishing the job
echo [`date $format`] exit state $?

Anyone familiar with shell-scripts will immediately understand (with the exception of comments prefixed with #$ which will be explained later). In order to run this script as a job in the cluster create a file sleeper.sge in your personal directory on the shared storage. Since we run on the Prometheus cluster it will be located in a sub-directory of /hera/. In this example a user "jdow" stores files into a directory /hera/hpc/jdow/. If you follow this example you will need to adjust the path to the shared storage accordingly!

Submit a job to the cluster with the command qsub.
The options -o /path/to/file.log specifies the location to write the job output to. The last parameter is the wrapper-script to submit.
jdow@lxsub08:~ » qsub -o /hera/hpc/jdow/sleeper.log /hera/hpc/jdow/sleeper.sge
Your job 1534916 ("sleeper.sge") has been submitted
jdow@lxsub08:~ » qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
1534916 1.00000 sleeper.sg jdow         r     06/27/2013 13:20:49 default@lxb858.gsi.de              1 
jdow@lxsub08:~ » cat /hera/hpc/jdow/sleeper.log 
[2013/06/27-13:20:49] jdow@lxb858
[2013/06/27-13:21:49] exit state 0

The job name is usually the name of the script/executable that you have submitted, you can use an other name via 'qsub -N testjob...' , be aware that numbers at the begin of the name are not accepted. If your job has been accepted by the system its job identification number will be printed. Check the state of your job with qstat. As mentioned before by default only your jobs will be shown. The "state" column indicates the current state of your job. You will find a complete list of all possible states in the man page of qstat. For now, just two states are important to know qw (queue wait) and r (running). The first indicates that the job is waiting to be executed as soon as a suitable resources become available. Depending on the load of the cluster this may take a while. If your jobs disappears from the list it has been removed from the cluster, most likely because it was finished.

Track Your First Problem

Hopefully by now you have been able to successfully execute a job on the cluster. It is very likely that at some point in the future you will face the situation that your jobs break for whatever reason. One of the difficulties in these cases is to distinguish between problems internal to your application and issues caused by the execution environment. The following code is a wrapper script used to submit a job executing a program called segfaulter. This program will run for 20 seconds and break badly!

#$ -wd /tmp 
#$ -j y

# Function to print log messages
_log() {
  local format='+%Y/%m/%d-%H:%M:%S'
  echo [`date $format`] "$@"
}

# Identity of the compute node executing the job
_log $JOB_ID Running on $USER@`hostname`:$PWD
# A faulty program
/hera/hpc/jdow/segfaulter &
# The process ID of the last spawned child process
child=$!
_log Spawn segfaulter with PID $child
# Wait for the child to finish
wait $child
# Exit signal of the child process 
state=$?

_log Finishing with $state
# Propagate last signal to the system
exit $state 

Let's take a closer look to this wrapper script. First it collects some information about the execution environment, the JOB_ID (which is defined by GridEngine), user account name, execution host name and working directory. In a real application it would be necessary to check for all dependencies before launching the actual program, including input data, libraries and so on. Once the program starts the process ID is logged and upon finishing its exit signal is recorded. If the job has failed during execution, this information will help to understand what happened.
jdow@lxsub08:~ » qsub -o /hera/hpc/jdow/segfaulter.log /hera/hpc/jdow/segfaulter.sge 
Your job 1542056 ("segfaulter.sge") has been submitted
[...SNIP...]
jdow@lxsub08:~ » cat /hera/hpc/jdow/segfaulter.log 
[2013/06/27-15:29:06] 1542056 Running on jdow@lxb869:/tmp
[2013/06/27-15:29:06] Spawn segfaulter with PID 7978
/var/spool/gridengine/execd/lxb869/job_scripts/1542056: line 16:  7978 Segmentation fault      (core dumped) /hera/hpc/jdow/segfaulter
[2013/06/27-15:29:26] Finishing with 139

We can see the job identifier, when and where it was executed, what was executed (including its process ID), and how it died. That is the minimum amount of information required for decent debugging of problems.

Submitting Jobs

Users have extensive control about the job execution environment. Generally the command to submit (groups of) jobs is of the following form:
qsub [ OPTIONS ] [ EXECUTABLE | -- [ ARGUMENTS ]]

The EXECUTABLE is the path to the program you want to run, and you can pass optional ARGUMENTS (parameters) to your program. (If no executable is given the qsub command puts the cursor on the following line and waits for manual commands from you. These commands, typed one by one, will constitute the job to be executed. Input commands are closed by Ctrl-D.)

A batch job typically consists of shell scripts with a set of information concerning your job and the call to the main application program to execute. You can specify qsub options inside your script. These options should be given at the beginning of the script, before all other lines of your job. Each line containing options should start with the prefix #$ YOUR_OPTION. (You can change this prefix with option -C PREFIX.)

Working Directory

As described above we do not support shared home-directories on the cluster. Therefore you need to follow a couple of simple rules. First and foremost it is required to define your job working directory to be located in /tmp/. One way to do that is by submitting jobs with the parameter -cwd when your current working directory is in /tmp/. But it is better to define it with -wd /tmp in order to allow you to submit your jobs from any directory.

Whatever output you produce while your job is running, you need to make sure to move it to the shared file-system (persistent storage) in the LustreFs if you want to keep it. Data only needed during the execution should be stored using the environment variable $TMPDIR. This variable is set by GridEngine for each job individually.

Scratch Space & Log-files

GridEngine redirects to standard error and output to dedicated files. The path to both can be defined using the parameters -o PATH and -e PATH. Optionally you can join both output streams with the option -j y. Each job gets a dedicated directory serving a local scratch space accessible by the path defined in $TMPDIR.

The following job merges standard output and error, and copies the log file from the execution host to the persistent storage at the end. It uses the environment variable $SGE_STDOUT_PATHholding the path to the log file.

#$ -wd /tmp 
#$ -j y
# Path to persistent storage
_target=/lustre/
# Generate random numbers on local storage
cat /dev/urandom | tr -dc '0-9' |\
  fold -w 50 | head -n 50 > $TMPDIR/$JOB_ID.data
# Move the data to the persistent storage
cp -v $TMPDIR/* $_target 
# Evacuate the log-file
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.log

The random numbers calculated by this job are stored on the execution host while they are processed. The resulting file containing the output data is moved from the scratch space to the persistent storage before the job finishes. Notice that in order to execute this job you need to replace <PATH> with an directory in LustreFs you have write access to.

This is just one of the possibilities how you can handle log and output data. Depending on the nature of our computing jobs you may want to write output immediately to the persistent storage on the LustreFs.

Resource Requirements

Instead of defining a specific queue when submitting jobs, you need to define resource requirements. According to this information GridEngine will find the right execution node, and will be able to utilize the cluster more efficiently.
$ qsub -l mem=3G,ct=03:00:00 ...

Define resource limits with the -l resource=value,.. option to the qsub command. The example above defines a CPU-time limit ct of 3 hours and a maximum memory usage mem of 3GB.

You can verify if the resource management system can fulfill your resource requirements using the option -w v. If the system cannot schedule your job it will answer with "verification: no suitable queues". In case there are existing resources to execute your requirements the answer is "verification: found suitable queue(s)".

The execution time expressed will not take into account the difference in terms of performance between the different machines in the farm. You should require CPU time with an sufficient safety margin. There are two formats of definition, either in seconds or alternatively a string like hours:minutes:seconds.
-l ct=3600
-l ct=01:30:00

When defining memory limits you always should use a unit. Possible units are G,M, K, and B.

If you have a hang on dedicated nodes that you (don't) want to use :
$ qsub -l hostname=lxb1000 ...        //will use ONLY lxb1000 for this job
$ qsub -l hostname=!(lxb1000) ... //will NOT use lxb1000 for this job

Please keep in mind that it is usually better to use both (CPU and/or RunTime AND real and/or virtual memory) resource requirements, otherwise it is likely that your job will go to the queue with the lowest resource limits which satisfies your requirements.

So if you do qsub -l v_mem=5G ... the job will go to one queue with equal or more than 5 GB of memory but the runtime can be 1 , 8 or 24 hours.

Submitting Executables

Submit executable scripts (Perl, Python or Ruby) or binary files by adding the option -b y to the qsub command, otherwise the system will prevent you from executing non shell-scripts. Furthermore you need to locate the binary/script in the LustreFs and define the absolute path to it when submitting.
qsub -b y /path/in/lustre/to/your/executable.py

Array Jobs

Instead of submitting each working task as a single job, it is possible to submit a single job spawning as many tasks as defined. The following example is the same random number generator as in the previous section, but with the capability to produce as many random number samples as the user specifies using the -t RANGEoption:

#$ -wd /tmp 
#$ -j y
# Name of this job
#$ -N data
# Number of sub-tasks to start
#$ -t 1-10
_target=/lustre/
cat /dev/urandom | tr -dc '0-9' |\
  fold -w 50 | head -n 50 > $TMPDIR/$JOB_ID.$SGE_TASK_ID.data
cp -v $TMPDIR/* $_target 
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.$SGE_TASK_ID.log

The only difference between each execution is an environment variable $SGE_TASK_ID. Use this variable to distinguish task dependent output files. Remember that GridEngine will start many tasks simultaneously, which means you cannot rely on anything provided by another task.

Limit the number of tasks with the option -tc NUMBER_OF_TASKS. It defines the maximum number of tasks which will be executed in parallel.

Running multiple jobs in parallel is a extremely common task on clusters. Generally it is possible to submit each job individually by writing your own loop around the job submission procedure. But, take into account that each job consumes resources on the GridEngine master. When scheduling thousands of jobs this can be a significant burden up to the point that the system will slow down or even crash. Therefore it is beneficial for the cluster and all users if multiple working tasks are submitted as an array job.

Additionally it is easier to control your work because the output of the batch system is better viewable if using job arrays instead of single jobs.

It is possible to remove a single task from an array job using qdel JOB_ID.SGE_TASK_ID, or the entire array using its job ID.

Job Dependencies

Maybe you noticed that we have added -N NAMEto the example from the last paragraph. This option defines a name for a (array) job. It is possible to use this name as a reference for defining execution dependencies between jobs.

#$ -wd /tmp 
#$ -j y
#$ -N merge
#$ -hold_jid data
# Path to persistent storage
_target=/lustre/
# Merge all random numbers
for f in $_target/*.data
do
  echo "Processing file $f "
  cat $f | grep -v '^0' >> $TMPDIR/$JOB_ID.merge.data
done
cp -v $TMPDIR/* $_target 

In this example the job waits for another job called "data" by defining the -hold_jid option.

Parallel Jobs

What are parallel jobs? Generally each job is assigned a single CPU. If an application is developed to utilize many CPUs within a single program, then the user needs to specify this resource requirement when submitting it as a job. GridEngine supports multiple so called "parallel environments" used for different types of multi-core applications (use qconf -spl to see a list of provided PEs).

If you want to execute a program based on shared memory access by multiple parallel threads on a single host (e.g. implemented with OpenMP), use the SMP parallel environment with the submit option -pe smp CORES. This will ensure that the number of CORES you have defined will be allocated on a single execution node.

Applications based on a distributed-memory architecture communicating using MPI are submitted to the OpenMPI parallel environment with the option -pe openmpi CORES. They will be distributed among as many execution nodes as GridEngine identifies as suitable.

Get started with the following simple script that you can utilize to submit your MPI program. You will need to replace PATH_TO_SHARED_STORAGE with the path to your directory on the Lustre file systems. For the purpose of this example we call this script mpi_submit.sge:

#!/bin/bash
#$ -j y
#$ -pe openmpi 50
#$ -N mpi_test
#$ -wd /tmp
_target=PATH_TO_SHARED_STORAGE
_exec=$1
format='+%Y/%m/%d-%H:%M:%S'
echo `date $format` Starting...
echo Hello from $USER@`hostname`:`pwd` 
echo Starting MPI process
mpirun $_exec
echo See you!
echo `date $format` ..finished. 
cp -v $SGE_STDOUT_PATH $_target/$JOB_ID.log

Assuming you are logged on to one of the submit nodes, compile your MPI program (using for example the mpic++ compiler) and verify that it executes properly (by using mpirun -np 4). This would look something like the following:
$ mpicc -o hello_world.o hello_world.c
$ mpirun -np 4 hello_world.o
Hello world from process (pid 32213) rank 1 (of 4 processes)
Hello world from process (pid 32212) rank 0 (of 4 processes)
Hello world from process (pid 32214) rank 2 (of 4 processes)
Hello world from process (pid 32215) rank 3 (of 4 processes)
$ cp hello_world.o /lustre/rz/jdoe/bin/
$ qsub mpi_submit.sge /lustre/rz/jdoe/bin/hello_world.o

If your program executes well copy the compiled binary file to your shared storage directory (in the example above /lustre/rz/jdoe) and submit a test job using the mpi_submit.sge script followed by the path to your program executable. Monitor the state of your job by adding the -g t option to _qstat_ to show slave processes also.

Monitoring Jobs & Troubleshooting

While jobs are in the batch system, users can access monitoring information using qstat. When no options are applied it will list your jobs and their state.

Possible jobs states:

  • (t)ransfering
  • (r)unning
  • (R)estart
  • (s)uspended
  • (S)uspended by queue
  • Suspended queue (T)hreshold reached
  • (w)aiting for free resources
  • (h)old execution until dependency is ready
  • (e)rror
  • (E)rror
  • (q)ueued

Details about the job state can be displayed using the -j JOB_ID option. This will tell you why the job is pending and if there are any reasons why queues cannot accept the job. Currently it is not possible to have access to the log-files from GE, this means the qacct command won't work.

Job Priorities

Before the GridEngine starts to dispatch jobs, the jobs are brought into priority order, highest priority first. The system then attempts to find suitable resources for the jobs in priority sequence. Priorities are calculated by the share defined for your user group and the resources already consumed by your jobs in the past.

You can display the priority of your jobs in the system using the command:
qstat -ext -u '*'

Basic Debugging Process

Verify for yourself that the cluster and the batch system are happy before you do anything else. To that end, print an overview of the queues and their utilization:
qstat -g c

The last column indicates hosts in problematic state. Get a list of the hosts running your jobs with:
qhost -q -j

Use the command qstat to ask for the resource consumption of a job:
qstat -j JOB_ID | grep usage

If these commands do not indicate any problems, verify that your application actually runs on one of the InteractiveMachines, by executing it locally. Most common causes for a job running on the command-line of our submit-hosts but not on the cluster are:

  • Differences in environment variables.
  • Differences in the shell execution environment.
  • Incorrect permission on input files.
  • Use of shared file-systems only available on the submit-host, such as /u/.

If a job has been in the running state and died for any reason, the best resource for debugging is the standard output/error log-files.

You can get an email explaining why a job is aborted by adding the option -m j.doe@gsi.de.
Topic attachments
I Attachment Action Size Date Who Comment
GE2.pdfpdf GE2.pdf manage 83.8 K 2012-04-24 - 07:59 CarstenPreuss The presentation given on 8.3.2012
Topic revision: r40 - 2016-06-06, ChristopherHuhn