- 16 May 2011
How to use SGE in a non NFS environment?
SGE is developed to work in a NFS
environment, at least SGE itself is depending on a shared environment.
It is possible and tested to have standalone worker nodes just by copying the /SGE directorie.
The load sensors, the prolog/epilog scripts and the accounting file (for qacct, qstat is working) must be copied to the WN's.
How to submit/account jobs with different priorities?
It is possible to give the jobs a priority and give this priority a weight.
This is ok for the users to part important and normal jobs from each other.
If some user has something to do just in time and this user has no shares left he is in trouble.
He can ask for so called Override Tickets (can spread on the jobs and the tickets will be really used up).
But these things are not taking extra into account for FS. It would be nice if users could give their jobs priority over other jobs if this would be accounted extra.
Let's say there are prioritys defined as resource : HP, SP(0), LP
HP is accounted 1.5 to 2.0 times higher on FS than normal (0) jobs
LP is accounted 0.5 to 0.0 times less on FS than normal jobs
Without accounting this it would not make sense for users to use prioritys others than HP and in this case (everybody is using HP) it doesn't make sense anyway to implement it.
It is needed to define prioritys as a accounted resource...
some Alice users belongs still to kp1 and (even more worse, at least for Hades
) to Hades with their GID
GID is used by SGE
there is a time gap then a job is done in which the job does not appear via qstat nor qacct
cpreuss@lxir001:~ $ qstat -j 37238
Following jobs do not exist:
cpreuss@lxir001:~ $ qacct -j 37238
error: job id 37238 not found
This can confuse automated monitoring scripts.
as we don't want users to define/use queues the -w e parameter used on the qsub environment and setting the complex qname as requestable = NO refuses jobs with a queue parameter.
Resource reservations are not accounted
If a user reserve resources, e.g. CPUs, these CPU's will be idle until the reservation will take place. In this time the CPUs can be used only via backfill jobs, but this is not garantied.
This not-CPU-using is not accounted, but should.
It is possible to get the reservations accounted, at least this is claimed in some mailing list....
SGE Bugs and Problems
-l h_rss is not forcing a kill of the job if it is overtaken, not on job nor on queue level
-l h_rt is also taking suspended times into account, suspended times are not accountet extra (like in LSF
-l h_vmem is working.
Overbooking will not work in the case all (or at least the majority) will use more than the avarage of one resource at the same time.
Experienced with real memory. The idea was to have more than 1 core per job or more than the available 2 GB per core pre job.
Suspend threshold free_mem was set to 2 GB, suspend intervall=1 min
This leads to the situation that more or less the complete memory was used up (memory allocation still faster than job suspending) and that at the end
all jobs were suspended. Without running jobs, no job will end and so no memory could be free - the node was stopped!
The other way around with NO mem_free threshold would let all jobs run and then swap, also not better.
The needed solution would be to let ONE job running (this would force some suspended jobs to be moved to swap) and let enough memory free for the running job.
qhold/qalter -h/qmod -sj
qhold only for waiting jobs, but flag h(old) can also be set for running jobs (without consequences)
The qhold command should be dropedand qalter/qmod should be used for jobs in all states.
Memory usage monitoring
Memory usage about 4 GB is wraped around, so 5 GB is shown as 1 GB.
Limits above 4GB are enforced.
The manual talks about qstat -z for zombie jobs, this is nonsense, there are no zombie/unknown jobs in SGE.
They mean done jobs.
If qstat -z is used the jobs are shown in the state qw, should be d or whatever.
Mismatch between GUI and CLI
There are some missmatches in the shown job states between GUI and CLI.
Job state zombie/unknown needed?
Jobs which are residing on unavailable hosts should be marked.
They are, but only in the GUI.
Hosts in unknown state
Hosts in unknown state should be marked, so far this state is only remarkable because of missing host variables.
Dependency jobs have to wait for all jobs to be done (no matter in which state).
It would be nice to :
a) define a exit status
b) amount of jobs to be done
c) define to wait or not for jobs in unknown state
Show complex resources
Using qstat -l tmp_free=1T or qhost -l tmp_free=1T will fail with
cpreuss@lxir001:~ $ qhost -l lustre_free=1T
error: attribute "lustre_free" is not a memory value
Don't get the message wrong, tmp_free IS a memory variable, SGE only does not know the 'T' ! Use G instead.