- 12 Apr 2011
ATTENTION : In SGE the help for a single command is shown by
NOT -h NOR --help !
The easiest way to submit a job is
Unlike in the LSF
cluster we have NO standard batch queue in the SGE cluster, SGE makes a job matching every time a slots is freed in the cluster.
SGE will only copy the script which was defined at the qsub line, nothing else!
If you want to execute a binary (or a file which is already on the target location) use the -b(inary) y(es) option.
qsub -b y /bin/date
If you don't define a path for the error and output files they will be written to the current working directorie, which is usually /tmp.
In these cases you have to take care by yourself to retrieve these files, unlike in LSF
you have to specify if you want to get the output via mail.
Defining resource requirements will help SGE to put your jobs in the queues which will fit to your jobs the best.
If these queues are filled up SGE will put your jobs in queues with higher resource requirements :
qsub -l h_vmem=3G,h_rt=2h
will place your jobs in the short queue, if this queue is full your jobs will go to the default queue and so on.
As this numbers are not only requirements to SGE but also limits, SGE will kill jobs which are overtaking these limits.
So one should be consservativ with the resource requs to get a fast scheduling but not to conservative to prevent a fast killing
If you want to get mails from SGE you can use the
qsub -m X
there X is one or more of the following :
b send email at the beginning of the job
e send email at the end of the job
a send email when the job is rescheduled or aborted (killed)
s send email when the job is suspended
n do not send a mail, this is the default behaviour.
qsub -m be script.sh
will send oyu a mail at the start and another mail at the end of your job.
Submit job arrays
We are strongly asking the users to use job arrays for large productions, as this means less work for the SGE scheduler and it is easierer to monitor for you and for us.
qsub -t 1-100 -b y /bin/sleep 1000
submits one job with 100 tasks
qstat -j jobID
will then show :
submission_time: LXadmin.Mon Oct 31 13:18:56 2011
job-array tasks: 1-100:1
usage 2: cpu=00:00:00, mem=0.00002 GBs, io=0.00002, vmem=3.773M, maxvmem=3.773M
usage 5: cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
usage 7: cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
usage 8: cpu=00:00:00, mem=0.00000 GBs, io=0.00000, vmem=N/A, maxvmem=N/A
usage 9: cpu=00:00:00, mem=0.00002 GBs, io=0.00002, vmem=3.773M, maxvmem=3.773M
so it's much easierer to have a look on the running jobs and compare their resource usage. Done jobs are not shown here
Submit Dependency Jobs
A dependency jobs is a job which waits for an other job to be finished.
E.g. you submit a simulation in the first job and an analysis job to analyse the simulated data in the second job.
With dependency jobs you can submit both jobs directly one after the other instead of waiting for the first job to be done.
=Your job 336394 ("script.sh") has been submitted=
=qsub -hold_jid 336394 anotherscript.sh=
Using qstat you will see that your second jobs is in the state hqw (hold queue waiting) until the first job is done.
Unlike in LSF
there is no possibility to wait for a defined number or all DONE or EXIT jobs, all jobs have just to be finished in some way.
To kill a job just do
qdel 4711 qdel -u $(/bin/whoami) qdel '*'
will delete all (your) jobs with the name blah*
Using the -f option will remove the job from SGE even if the job/processes can't be killed at the current moment, e.g. if the SGE execd or shepherd is unavailable. After restarting the SGE service on the target node the job will be killed.
Used to suspend jobs.
Use qalter or better qmod instead.
To modify a job use
qalter -j 4711 ...
Some options can even be altered for running jobs, unfortunately it is not possible to modify the queue for running jobs.
If you have a job in the state Eqw (error queued waiting) you can do :
qmod -cj JobID
to clear the error flag. The job will then be rescheduled.
To suspend jobs use
qmod -sj jobID
qmod -usj jobID
to unsuspend jobs.
will show you all your jobs in the different states.
Filtering jobs in different states with
qstat -s p|r|h|d
Unlike in LSF
it is not possible to show all jobs (together with DONE/EXIT) with one command.
qstat -s z
shows a "qw" as job status. This is wrong, the jobs status is DONE (or EXIT)!
WARNING : There is a small time gap in which jobs are done but not visible via qacct.
qstat -g c
you will get an overview on the cluster, it shows running jobs, additional the remaining slots, slots beeing unuseable due to different reasons and the queue load.
It will NOT show restrictions on resources, e.g. slots for a queue/user restricted by Resource Quotas.
shows all nodes with their vital values
qhost -l tmp_free
shows additional resource values, here the free tmp space
qhost -q -j
shows all nodes with their queue instances and the jobs running inside
WARNING : As the accounting files currently NOT available on nodes beside of the master one CAN'T use the qacct command! (at the moment...)
qacct can be used either to get infos about done jobs or retrieve infos about used resources.
qacct -j 4711
will show a very lenghty report about the job, there a no options like short, full or long output.
qacct -o -d 100
shows a nice table sorted by users showing the used resources in the last 100 days.
Where CPU is measured in CPU seconds, MEMORY in GB seconds, IO in GB (both disk and network) and IOW in seconds.
REMARK : Unlike with qstat -u (for user) in qacct the option -o (for owner) has to be used.
Geting general informations
If you are interested in the cluster configuration, e.g. the resource limits from a queue you can use the qconf command.
shows the central cluster configuration
qconf -s(how)q(ueue) default
| grep h_rss shows you the real memory limit for the default queue