LSF Load Sharing Facility

Load Sharing Facility (LSF) is a cluster monitoring and queuing system. Interactive as well as batch jobs are distributed throughout the cluster by an intelligent load balancing system. A master host provides scheduling, priority queing and load balancing for the submission and executuion hosts.

  • Monitoring of the cluster's load
  • Starting inteactive or batch jobs on the least loaded host

Cluster Node
  • Automatic login to least loaded cluster node
  • parallel processing (PVM od MPI)
  • GUI and command line interface

LSF comes with with efficient error response, taking charge of the following situations:

  • Master breaks down: another host automatically takes over as master
  • Master not available: no LSF operation possible (contradicts above statement?)
  • Batch node breaks down: Running jobs are lost, automatic restart is possible (?)
  • LizenzServer not available: No new jobs can be dispatched.

xlsmon: GUI Monitor für LSF

The load of the LSF-farm can be watched using xlsmon. xlsmon shows the resources used for load balancing, such as CPI, I/O, Paging etc. for each node:

In addition, history loadd information is available, can berestricted to single hosts:

lsload: Show the current Load

lsload or lsmon xhow the same info as xlsmon on the command line.

HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
linux5.gsi.de ok 0.0 0.0 0.0 0% 0.5 2 10 15G 128M 236M
linux4.gsi.de ok 0.0 0.0 0.2 6% 4.2 6 2 14G 125M 213M
linux2.gsi.de ok 2.0 2.0 1.7 100% 0.6 5 2 14G 127M 222M
linux1.gsi.de ok 2.3 1.5 0.2 100% 2.9 10 0 13G 128M 186M
linux3.gsi.de unavail

xlsbatch: GUI for LSF Batch

xlsbatch is a tool to submit batch jobs, to change the priority of jobs, to kill jobs or to view the output of jobs during execution (Peek). In addition, information on job details and history can be retreived.

bqueues: Queue Information

The BatchQueues can be queried on the command line using bqueues.

xlsbatch or xbsub: Submit a Job

xbsub is a sub menu of xlsbatch which can also be called directly. The desired options can be entered here, in a further sub menu also advanced direction concerning pre-exec commands, job dependencies etc.:

Resource limits are soft limits, which can be set by the user. Hard limits for these resources can be set in the definition of the queues. At present, this is only being done for CPU limit of the queues short and quick.

xlsbatch or xbsub: Remote File Transfer

Remote File Transfer maskes it possible to access data on the submission host e.g. without NFS. This feature can be accessed vie the Advanced button of the xlsbatch/xbsub menues:

bkill Job Kill

PRE> bkill [ -h ] [ -V ] [ -l ] [ -s (signal_value|sig nal_name ) ] [ -q queue_name ] [ -m host_name ] [ -u (user_name | all ] [ -J job_name ] [ jobId | "jobId[index_list]" ... ]

-l Display the set of signal names supported. (This is a subset of those supported by /bin/kill and is platform dependent).

-s value | name Send the signal specified by value or name to the specified jobs. Default: SIGKILL or 9. -q queue Send a signal to the jobs in the specified queue. -m host Send a signal to the jobs dispatched to the specified host or host group. Ignored if a job ID other than 0 is specified. -u user | all Send a signal to the jobs submitted by the named user or by all users. Ignored if a job ID other than 0 is specified. -J jobname Send a signal to the named jobs. Ignored if a job ID other than 0 is specified. jobId... Send a signal to the specified jobs.

bkill 0 kills all jobs owned by oneself.

bjobs: Display Job Details

bjobs [-l | -w] [-a] [-d] [-p] [-s] [-r] [-N spec] [-q queue]
[-m host|cluster] [-u user|all] [-J jobname] [-P project] [jobId...]

-l
 Display information in a (long) multi-line format.
-w
 Display fields in a (long) multi-line format without truncation.
-a
 Display information about all jobs, including unfinished jobs
 (pending, running or suspended) and recently finished jobs.
-d
 Display only recently finished jobs.
-p
 Display only pending jobs, with the reasons they were not
 dispatched during the last dispatch turn.
-s
 Display only suspended jobs, showing the reason for suspension.
-r
 Display only running jobs.
-N spec
 Display CPU time consumed by the job.
 The appropriate CPU scaling factor for the specified host,
 or defined host model, is used to normalize the actual CPU time
 consumed by the job.
-q queue
 Display only jobs in the named queue.
-m hostname
 Display only jobs dispatched to the named host or host group.
-u user|all
 Display jobs submitted by the named user or by all users.
 Default: display the jobs submitted by the user who invoked
 this command.
-J jobname
 Display all jobs with the s[ecified name.
-P project
 Display only jobs belonging to the named project.
 Default: display jobs belonging to all projects.
jobId...
 Display the job(s) with the specified job ID.
 The value 0 is ignored.

bhist Displays history of Job(s)

bhist [-b] [-l] [-w] [-a] [-d] [-p] [-s] [-r] [-f logfile] [-N spec]
[-C time0,time1] [-S time0,time1] [-D time0,time1] [-q queue] [-m host]
[-u user|all] [-J job] [-n number] [-P project] [jobId...]

-b
 Display the job history in a brief format.
 Default: display summary format.
-l
 Display the job history in a (long) multi-line format,
 giving detailed information about each job.
 Default: display summary format.
-w
 Display the job history in a (wide) multi-line format without
 truncation. Default: display summary format.
-a
 Display all, both finished and unfinished, jobs.
 Default: finished jobs are not displayed.
-d
 Display only the finished jobs.
-p
 Display only the pending jobs.
-s
 Display only the suspended jobs.
 If option -l or -b is also specified, show the reason why
 each job was suspended.
-r
 Display only the running jobs.
-f logfile
 Specify the file name of the event log file.
 Either an absolute or a relative path name may be specified.
 Default: use the system event log file lsb.events.
-N spec
 Display CPU time consumed by the job.
 The appropriate CPU scaling factor for the specified host,
 or defined host model, is used to normalize the actual
 CPU time consumed by the job.
-C time0,time1
 Display only those jobs whose completion or exit times
 were between time0 and time1.
 Default: display all jobs that have completed or exited.
-S time0,time1
 Display only those jobs whose submission times were between
 time0 and time1.
 Default: display all jobs that have been submitted.
-D time0,time1
 Display only those jobs whose dispatch times were between
 time0 and time1.
 Default: display all jobs that have been dispatched.
-q queue
 Display jobs submitted to the specified queue only.
 Default: display all queues.
-m host
 Display jobs dispatched to the specified host only.
 Default: display all hosts.
-u user|all
 Display jobs submitted by the named user or by all users.
 Default: display the jobs submitted by the user who invoked this command.
-J jobname
 Display all jobs with the s[ecified name.
-P project
 Display only jobs belonging to the named project.
 Default: display jobs belonging to all projects.
jobId...
 Display the specified job(s).

bpeek Display Output and Error Output (std) of unfinished Jobs

bpeek [-f] [-q queue | -m host | -J jobname | jobId]

-f
 Display the output of the job using the command "tail -f".
 Default: use the command "cat".
-q queue
 Display the output of the most recently submitted job in
 the specified queue.
-m host
 Display the output of the most recently submitted job that
 has been dispatched to the specified host.
-J jobname
 Display the output of the most recently submitted job that
 has the given name.
jobId
 Display the output of the specified job.
 Default: display the output of the most recently submitted
 job that satisfies options -q, -m, or -J.

bsub Command Line Submit

In many cases, the command line submit tool bsub is more handy than the garphical tool xsub, esp. when using LSF scripts.

 bsub [ -h ] [ -V ] [ -H ] [ -x ] [ -r ] [ -N ] [ -B ] [ -I | -Ip |
 -Is | -K ]
 [ -q queue_name ... ] [ -m host_name[+[pref_level]] . ]
 [ -n min_proc[,max_proc] ] [ -R res_req ]
 [ -J job_name_spec ] [ -b begin_time ]
 [ -t term_time ] [ -i in_file ] [ -o out_file ]
 [ -e err_file ] [ -u mail_user ] [ [ -f " lfile op [
 rfile ]" ] .. ]
 [ -E "pre_exec_command [ argument ... ]" ]
 [ -c cpu_limit[/host_spec ] ] [ -F file_limit ]
 [ -W run_limit[/host_spec ] ]
 [ -M mem_limit ] [ -D data_limit ] [ -S stack_limit ]
 [ -C core_limit ]
 [ -k "chkpnt_dir [ chkpnt_period ]" ]
 [ -w depend_cond ] [ -L login_shell ]
 [ -P project_name ] [ -G user_group ]
 [ command [ argument ... ] ]

Job scripts can be provided interactively or as files. They could take the form of the following example:

# submit a Job in Queue Short

bsub -q short

# change working directory
bsub > cd /u/lasi/lsf
# submit Job myjob with Parameter 1
bsub > myjob 1
# change Jobname to TESTJOB
bsub > #BSUB -J TESTJOB
# submit the Job
bsub > ^D


# submit a Job to Queue batch

bsub -q batch

# change Submit Queue to test
bsub > #BSUB -q test
# send output to data/out and submit the Job to a
# host with > 100 MB available Memory
bsub > -o data/out -R "mem > 100"
# submit Job cpu with Paramter 20
bsub > cpu 20
# change Jobname to TEST1
bsub > #BSUB -J TEST1
# submit the Job
bsub > ^D

bsub: Parameter

Job Dependencies: bsub -w depend_cond ...

Example: Jobs will only be started once others have sucessfully completed:

bsub -w " FirstJob && SecJob " -J ThrdJob myjob

The job myjob is executed under the name ThrdJob after tje jobs FirstJob and SecJob havbe finished.

-w *started / done / exit / ended * gives access to the job status. Defauklt of -w is done.

Host dependent jobs: bsub -m $LSB_HOSTS

Example: Start of x-Jobs on host VorJobs

Command Line: bsub myjob 1

File myjob :
i=$1

i=`expr $i + 1`

while i=`expr $i - 1 `; do
   bsub -q batch -m $LSB_HOSTS /u/lasi/c/cpu$i 20
done

All jobs started from within myjob will be executed on the host that had been selected by LSF for bsub myjob 1.

Preexecution Commands bsub -E pre_exec_command

Example: Check of /tmp space and if necessary cleanup of tmp files prior to execution of a programm

File myjob3:
set `df tmp`

if [$11 -lt 10000000 ]; then
   # freier Space unter 10 GB ?
   date=`/bin/date`
   uname=`/bin/uname -nm`
   echo "Cleanup started on $uname at $date"
   direct=/
   expr 0 = `/bin/ls -1Aq ${direct}tmp | \
      /usr/bin/grep -v "\`/bin/echo '[ \t]'\`" | \
      /usr/bin/wc -l` > /dev/null
   case $? in
      1 )
         ESC_direct=`echo "${direct}" | /bin/sed "s!/!\\\\\\\\/!g"`
         /bin/ls -1Aq ${direct}tmp | \
            /usr/bin/grep -v "`/bin/echo '[ \t]'`" | \
            /bin/sed "s/^/${ESC_direct}tmp\//" | \

            # Delete all /tmp Files

            /usr/bin/xargs -e /bin/sh -c '`/usr/bin/find ${0} ${@} \
            -type f -exec rm {} 2> /dev/null \; `' 
      ;;
   esac
fi

Command Line: bsub myjob_preex

File myjob_preex :

bsub -E lsf/myjob3 -q batch -m $LSB_HOSTS c/cpu 20

Program c/cpu will be executed only if lsf/myjob3 has finished successfully.

Remote File Access...*bsub -f "[lfile op rfile]]" *

If a file is not available on the execution host (e.g. due to NFS problems), it can be copied from the submission host to the execution host:

user@lxi007:/u/user> bsub -m linux1 -f "/tmp/mb >" myjob

copies the file /tmp/mb = fomr =linux2 to linux1 prior to execution of program myjob.

Parallel Jobs...*bsub -n min_proc,_max_proc_*

LSF tries to provide max_proc CPUs for the calling job. The job is started only of min_proc CPUs can be allocated. After the start of the job no further CPUs are used.

Using the JobSlot Reservation Feature, CPUs can be reserved for a defined amount of time. This increases the porbability to get the desired number of CPUs.

bsub oder lsrun: Run an Interactive Remote Job

LSF provides direct interaction with the execution host via command line or GUI. Even key combinations such as Ctrl-C behave as in local applications. Accessing the Interactive Job Facility via bsub adds the Job Pending Time to the waiting time.

Interactive Jobs: bsub -I

=bsub waits for the command to complete:

bsub -I -m linux3 dir /tmp/scratch

Output goes to the terminal.

The command lsrun is more efficient, enabling immediate execution.

Interactive Jobs lsrun

lsrun -m linux3 dir /tmp/scratch

Interactive remote execution is also possible via a pseudo terminal:

lsrun -m linux3 -P vi /tmp/m1

or

bsub -Ip -m linux3 vi /tmp/m1

lsgrun: Run interactive parallel Jobs

The give job is run interactively on the specified hosts:

lsgrun -m "linux1 linux2 linux3" cat /tmp/m1 >> /tmp/m1

The files /tmp/m1 are concatenatd on the hosts linux1=-=linux3.

To delete these files, one can use lsgrun with the option -p (parallel):

lsgrun -m "linux1 linux2 linux3" -p rm /tmp/m1


German orginal of this docu by:

-- Bärbel Lasitschka - Jun 1999 -- ChristopherHuhn - 06 Apr 2004


-- ThomasRoth - 16 Oct 2006
Topic revision: r1 - 2006-10-16, ThomasRoth