LSF Load Sharing Facility
Load Sharing Facility (LSF) is a cluster monitoring and queuing system. Interactive as well as batch jobs are distributed throughout the cluster by an intelligent load balancing system. A master host provides scheduling, priority queing and load balancing for the submission and executuion hosts.
- Monitoring of the cluster's load
- Starting inteactive or batch jobs on the least loaded host
Cluster Node
- Automatic login to least loaded cluster node
- parallel processing (PVM od MPI)
- GUI and command line interface
LSF comes with with efficient error response, taking charge of the following situations:
- Master breaks down: another host automatically takes over as master
- Master not available: no LSF operation possible (contradicts above statement?)
- Batch node breaks down: Running jobs are lost, automatic restart is possible (?)
- LizenzServer not available: No new jobs can be dispatched.
xlsmon: GUI Monitor für LSF
The load of the LSF-farm can be watched using
xlsmon
.
xlsmon
shows the resources used for load balancing, such as CPI, I/O, Paging etc. for each node:
In addition, history loadd information is available, can berestricted to single hosts:
lsload: Show the current Load
lsload
or
lsmon
xhow the same info as
xlsmon
on the command line.
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
linux5.gsi.de ok 0.0 0.0 0.0 0% 0.5 2 10 15G 128M 236M
linux4.gsi.de ok 0.0 0.0 0.2 6% 4.2 6 2 14G 125M 213M
linux2.gsi.de ok 2.0 2.0 1.7 100% 0.6 5 2 14G 127M 222M
linux1.gsi.de ok 2.3 1.5 0.2 100% 2.9 10 0 13G 128M 186M
linux3.gsi.de unavail
xlsbatch: GUI for LSF Batch
xlsbatch
is a tool to submit batch jobs, to change the priority of jobs, to kill jobs or to view the output of jobs during execution (
Peek). In addition, information on job details and history can be retreived.
The
BatchQueues can be queried on the command line using
bqueues
.
xlsbatch or xbsub: Submit a Job
xbsub is a sub menu of
xlsbatch which can also be called directly. The desired options can be entered here, in a further sub menu also advanced direction concerning pre-exec commands, job dependencies etc.:
Resource limits are
soft limits, which can be set by the user.
Hard limits for these resources can be set in the definition of the queues. At present, this is only being done for CPU limit of the queues
short and
quick.
xlsbatch or xbsub: Remote File Transfer
Remote File Transfer maskes it possible to access data on the submission host e.g. without NFS. This feature can be accessed vie the
Advanced button of the
xlsbatch/xbsub menues:
bkill Job Kill
PRE>
bkill [ -h ] [ -V ] [ -l ] [ -s (signal_value|sig nal_name ) ]
[ -q queue_name ] [ -m host_name ] [ -u (user_name | all ]
[ -J job_name ] [ jobId | "jobId[index_list]" ... ]
-l
Display the set of signal names supported.
(This is a subset of those supported by /bin/kill and is
platform dependent).
-s value | name
Send the signal specified by value or name to the specified jobs.
Default: SIGKILL or 9.
-q queue
Send a signal to the jobs in the specified queue.
-m host
Send a signal to the jobs dispatched to the specified host or host group.
Ignored if a job ID other than 0 is specified.
-u user | all
Send a signal to the jobs submitted by the named user or by all users.
Ignored if a job ID other than 0 is specified.
-J jobname
Send a signal to the named jobs.
Ignored if a job ID other than 0 is specified.
jobId...
Send a signal to the specified jobs.
bkill 0 kills all jobs owned by oneself.
bjobs: Display Job Details
bjobs [-l | -w] [-a] [-d] [-p] [-s] [-r] [-N spec] [-q queue]
[-m host|cluster] [-u user|all] [-J jobname] [-P project] [jobId...]
-l
Display information in a (long) multi-line format.
-w
Display fields in a (long) multi-line format without truncation.
-a
Display information about all jobs, including unfinished jobs
(pending, running or suspended) and recently finished jobs.
-d
Display only recently finished jobs.
-p
Display only pending jobs, with the reasons they were not
dispatched during the last dispatch turn.
-s
Display only suspended jobs, showing the reason for suspension.
-r
Display only running jobs.
-N spec
Display CPU time consumed by the job.
The appropriate CPU scaling factor for the specified host,
or defined host model, is used to normalize the actual CPU time
consumed by the job.
-q queue
Display only jobs in the named queue.
-m hostname
Display only jobs dispatched to the named host or host group.
-u user|all
Display jobs submitted by the named user or by all users.
Default: display the jobs submitted by the user who invoked
this command.
-J jobname
Display all jobs with the s[ecified name.
-P project
Display only jobs belonging to the named project.
Default: display jobs belonging to all projects.
jobId...
Display the job(s) with the specified job ID.
The value 0 is ignored.
bhist Displays history of Job(s)
bhist [-b] [-l] [-w] [-a] [-d] [-p] [-s] [-r] [-f logfile] [-N spec]
[-C time0,time1] [-S time0,time1] [-D time0,time1] [-q queue] [-m host]
[-u user|all] [-J job] [-n number] [-P project] [jobId...]
-b
Display the job history in a brief format.
Default: display summary format.
-l
Display the job history in a (long) multi-line format,
giving detailed information about each job.
Default: display summary format.
-w
Display the job history in a (wide) multi-line format without
truncation. Default: display summary format.
-a
Display all, both finished and unfinished, jobs.
Default: finished jobs are not displayed.
-d
Display only the finished jobs.
-p
Display only the pending jobs.
-s
Display only the suspended jobs.
If option -l or -b is also specified, show the reason why
each job was suspended.
-r
Display only the running jobs.
-f logfile
Specify the file name of the event log file.
Either an absolute or a relative path name may be specified.
Default: use the system event log file lsb.events.
-N spec
Display CPU time consumed by the job.
The appropriate CPU scaling factor for the specified host,
or defined host model, is used to normalize the actual
CPU time consumed by the job.
-C time0,time1
Display only those jobs whose completion or exit times
were between time0 and time1.
Default: display all jobs that have completed or exited.
-S time0,time1
Display only those jobs whose submission times were between
time0 and time1.
Default: display all jobs that have been submitted.
-D time0,time1
Display only those jobs whose dispatch times were between
time0 and time1.
Default: display all jobs that have been dispatched.
-q queue
Display jobs submitted to the specified queue only.
Default: display all queues.
-m host
Display jobs dispatched to the specified host only.
Default: display all hosts.
-u user|all
Display jobs submitted by the named user or by all users.
Default: display the jobs submitted by the user who invoked this command.
-J jobname
Display all jobs with the s[ecified name.
-P project
Display only jobs belonging to the named project.
Default: display jobs belonging to all projects.
jobId...
Display the specified job(s).
bpeek Display Output and Error Output (std) of unfinished Jobs
bpeek [-f] [-q queue | -m host | -J jobname | jobId]
-f
Display the output of the job using the command "tail -f".
Default: use the command "cat".
-q queue
Display the output of the most recently submitted job in
the specified queue.
-m host
Display the output of the most recently submitted job that
has been dispatched to the specified host.
-J jobname
Display the output of the most recently submitted job that
has the given name.
jobId
Display the output of the specified job.
Default: display the output of the most recently submitted
job that satisfies options -q, -m, or -J.
bsub Command Line Submit
In many cases, the command line submit tool
bsub is more handy than the garphical tool
xsub, esp. when using LSF scripts.
bsub [ -h ] [ -V ] [ -H ] [ -x ] [ -r ] [ -N ] [ -B ] [ -I | -Ip |
-Is | -K ]
[ -q queue_name ... ] [ -m host_name[+[pref_level]] . ]
[ -n min_proc[,max_proc] ] [ -R res_req ]
[ -J job_name_spec ] [ -b begin_time ]
[ -t term_time ] [ -i in_file ] [ -o out_file ]
[ -e err_file ] [ -u mail_user ] [ [ -f " lfile op [
rfile ]" ] .. ]
[ -E "pre_exec_command [ argument ... ]" ]
[ -c cpu_limit[/host_spec ] ] [ -F file_limit ]
[ -W run_limit[/host_spec ] ]
[ -M mem_limit ] [ -D data_limit ] [ -S stack_limit ]
[ -C core_limit ]
[ -k "chkpnt_dir [ chkpnt_period ]" ]
[ -w depend_cond ] [ -L login_shell ]
[ -P project_name ] [ -G user_group ]
[ command [ argument ... ] ]
Job scripts can be provided interactively or as files. They could take the form of the following example:
# submit a Job in Queue Short
bsub -q short
# change working directory
bsub > cd /u/lasi/lsf
# submit Job myjob with Parameter 1
bsub > myjob 1
# change Jobname to TESTJOB
bsub > #BSUB -J TESTJOB
# submit the Job
bsub > ^D
# submit a Job to Queue batch
bsub -q batch
# change Submit Queue to test
bsub > #BSUB -q test
# send output to data/out and submit the Job to a
# host with > 100 MB available Memory
bsub > -o data/out -R "mem > 100"
# submit Job cpu with Paramter 20
bsub > cpu 20
# change Jobname to TEST1
bsub > #BSUB -J TEST1
# submit the Job
bsub > ^D
bsub: Parameter
Job Dependencies: bsub -w depend_cond ...
Example: Jobs will only be started once others have sucessfully completed:
bsub -w " FirstJob && SecJob " -J ThrdJob myjob
The job
myjob is executed under the name
ThrdJob after tje jobs
FirstJob and
SecJob havbe finished.
-w *started / done / exit / ended *
gives access to the job status. Defauklt of
-w
is
done
.
Host dependent jobs: bsub -m $LSB_HOSTS
Example: Start of x-Jobs on host VorJobs
Command Line:
bsub myjob 1
File
myjob :
i=$1
i=`expr $i + 1`
while i=`expr $i - 1 `; do
bsub -q batch -m $LSB_HOSTS /u/lasi/c/cpu$i 20
done
All jobs started from within
myjob will be executed on the host that had been selected by LSF for
bsub myjob 1.
Preexecution Commands bsub -E pre_exec_command
Example: Check of /tmp space and if necessary cleanup of tmp files prior to execution of a programm
File
myjob3:
set `df tmp`
if [$11 -lt 10000000 ]; then
# freier Space unter 10 GB ?
date=`/bin/date`
uname=`/bin/uname -nm`
echo "Cleanup started on $uname at $date"
direct=/
expr 0 = `/bin/ls -1Aq ${direct}tmp | \
/usr/bin/grep -v "\`/bin/echo '[ \t]'\`" | \
/usr/bin/wc -l` > /dev/null
case $? in
1 )
ESC_direct=`echo "${direct}" | /bin/sed "s!/!\\\\\\\\/!g"`
/bin/ls -1Aq ${direct}tmp | \
/usr/bin/grep -v "`/bin/echo '[ \t]'`" | \
/bin/sed "s/^/${ESC_direct}tmp\//" | \
# Delete all /tmp Files
/usr/bin/xargs -e /bin/sh -c '`/usr/bin/find ${0} ${@} \
-type f -exec rm {} 2> /dev/null \; `'
;;
esac
fi
Command Line: bsub myjob_preex
File
myjob_preex :
bsub -E lsf/myjob3 -q batch -m $LSB_HOSTS c/cpu 20
Program
c/cpu will be executed only if
lsf/myjob3 has finished successfully.
Remote File Access...*bsub -f "[lfile op rfile]]" *
If a file is not available on the execution host (e.g. due to NFS problems), it can be copied from the submission host to the execution host:
user@lxi007:/u/user> bsub -m linux1 -f "/tmp/mb >" myjob
copies the file
/tmp/mb = fomr =linux2
to
linux1
prior to execution of program
myjob
.
Parallel Jobs...*bsub -n min_proc,_max_proc_*
LSF tries to provide
max_proc CPUs for the calling job. The job is started only of
min_proc CPUs can be allocated. After the start of the job no further CPUs are used.
Using the
JobSlot Reservation Feature, CPUs can be reserved for a defined amount of time. This increases the porbability to get the desired number of CPUs.
bsub oder lsrun: Run an Interactive Remote Job
LSF provides direct interaction with the execution host via command line or GUI. Even key combinations such as
Ctrl-C
behave as in local applications.
Accessing the Interactive Job Facility via
bsub adds the Job Pending Time to the waiting time.
Interactive Jobs: bsub -I
=bsub waits for the command to complete:
bsub -I -m linux3 dir /tmp/scratch
Output goes to the terminal.
The command
lsrun is more efficient, enabling immediate execution.
Interactive Jobs lsrun
lsrun -m linux3 dir /tmp/scratch
Interactive remote execution is also possible via a pseudo terminal:
lsrun -m linux3 -P vi /tmp/m1
or
bsub -Ip -m linux3 vi /tmp/m1
lsgrun: Run interactive parallel Jobs
The give job is run interactively on the specified hosts:
lsgrun -m "linux1 linux2 linux3" cat /tmp/m1 >> /tmp/m1
The files
/tmp/m1
are concatenatd on the hosts
linux1=-=linux3
.
To delete these files, one can use
lsgrun
with the option
-p
(parallel):
lsgrun -m "linux1 linux2 linux3" -p rm /tmp/m1
German orginal of this docu by:
--
Bärbel Lasitschka - Jun 1999
--
Christo - 06 Apr 2004
--
ThomasRoth - 16 Oct 2006