-- CarstenPreuss - 16 May 2011

How to use SGE in a non NFS environment?

SGE is developed to work in a NFS environment, at least SGE itself is depending on a shared environment.

It is possible and tested to have standalone worker nodes just by copying the /SGE directorie.

The load sensors, the prolog/epilog scripts and the accounting file (for qacct, qstat is working) must be copied to the WN's.

How to submit/account jobs with different priorities?

It is possible to give the jobs a priority and give this priority a weight. This is ok for the users to part important and normal jobs from each other.

If some user has something to do just in time and this user has no shares left he is in trouble.

He can ask for so called Override Tickets (can spread on the jobs and the tickets will be really used up). But these things are not taking extra into account for FS. It would be nice if users could give their jobs priority over other jobs if this would be accounted extra.

Let's say there are prioritys defined as resource : HP, SP(0), LP HP is accounted 1.5 to 2.0 times higher on FS than normal (0) jobs LP is accounted 0.5 to 0.0 times less on FS than normal jobs

Without accounting this it would not make sense for users to use prioritys others than HP and in this case (everybody is using HP) it doesn't make sense anyway to implement it.

It is needed to define prioritys as a accounted resource...

Accounting

some Alice users belongs still to kp1 and (even more worse, at least for Hades wink ) to Hades with their GID GID is used by SGE

Time gap

there is a time gap then a job is done in which the job does not appear via qstat nor qacct system response cpreuss@lxir001:~ $ qstat -j 37238 Following jobs do not exist: 37238 cpreuss@lxir001:~ $ qacct -j 37238 error: job id 37238 not found

This can confuse automated monitoring scripts.

Accessing queues/projects

as we don't want users to define/use queues the -w e parameter used on the qsub environment and setting the complex qname as requestable = NO refuses jobs with a queue parameter.

Resource reservations are not accounted

If a user reserve resources, e.g. CPUs, these CPU's will be idle until the reservation will take place. In this time the CPUs can be used only via backfill jobs, but this is not garantied. This not-CPU-using is not accounted, but should.

It is possible to get the reservations accounted, at least this is claimed in some mailing list....

SHARETREE_RESERVED_USAGE=true ACCT_RESERVED_USAGE=true

SGE Bugs and Problems

Limits

-l h_rss is not forcing a kill of the job if it is overtaken, not on job nor on queue level

-l h_rt is also taking suspended times into account, suspended times are not accountet extra (like in LSF)

-l h_vmem is working.

load/suspend thresholds

Overbooking will not work in the case all (or at least the majority) will use more than the avarage of one resource at the same time.

Experienced with real memory. The idea was to have more than 1 core per job or more than the available 2 GB per core pre job.

Suspend threshold free_mem was set to 2 GB, suspend intervall=1 min This leads to the situation that more or less the complete memory was used up (memory allocation still faster than job suspending) and that at the end all jobs were suspended. Without running jobs, no job will end and so no memory could be free - the node was stopped!

The other way around with NO mem_free threshold would let all jobs run and then swap, also not better.

The needed solution would be to let ONE job running (this would force some suspended jobs to be moved to swap) and let enough memory free for the running job.

Job suspending

qhold/qalter -h/qmod -sj qhold only for waiting jobs, but flag h(old) can also be set for running jobs (without consequences)

The qhold command should be dropedand qalter/qmod should be used for jobs in all states.

Memory usage monitoring

Memory usage about 4 GB is wraped around, so 5 GB is shown as 1 GB.

Limits above 4GB are enforced.

Done jobs

The manual talks about qstat -z for zombie jobs, this is nonsense, there are no zombie/unknown jobs in SGE. They mean done jobs.

If qstat -z is used the jobs are shown in the state qw, should be d or whatever.

Mismatch between GUI and CLI

There are some missmatches in the shown job states between GUI and CLI.

Job state zombie/unknown needed?

Jobs which are residing on unavailable hosts should be marked.

They are, but only in the GUI.

Hosts in unknown state

Hosts in unknown state should be marked, so far this state is only remarkable because of missing host variables.

Dependency jobs

Dependency jobs have to wait for all jobs to be done (no matter in which state). It would be nice to : a) define a exit status b) amount of jobs to be done c) define to wait or not for jobs in unknown state

Show complex resources

Using qstat -l tmp_free=1T or qhost -l tmp_free=1T will fail with

cpreuss@lxir001:~ $ qhost -l lustre_free=1T error: attribute "lustre_free" is not a memory value

Don't get the message wrong, tmp_free IS a memory variable, SGE only does not know the 'T' ! Use G instead.
Topic revision: r10 - 2011-11-07, BastianNeuburger