PipelineFX Forum
Qube! => Installation and Configuration => Topic started by: michael.graf on November 25, 2008, 12:23:54 PM
-
We have had this problem for a while, initially Case 13105.
Currently we have:
supervisor booting - version: 5.3-0 build: bld-5-3-2008-03-28-3.
supervisor_queue_binding: Internal
supervisor_queue_library: queue
As our job list grows, different users submit same cluster and priority jobs (/, 1000) with same resource requirements. But instead of the lowest job id going first, it seems to start running MUCH later jobs first in groups in what seems to be based on username.
Needless to say designers are starting to get rather upset now.
I even copied the example algorithm and changed to
supervisor_queue_binding: Perl
supervisor_queue_library: E:\pfx\qube\etc\VSalgorithm.pm
sub qb_init
{
print <<INIT;
#########################################################################################
Copyright: PipelineFX L.L.C.
All Rights Reserved.
Software: Qube!
Purpose: supervisor queuing algoritm replacement perl module.
This is an example module to be used for reference in
building a custom queuing algoritm.
Qube! license holders may modify this module for their own private use.
#########################################################################################
INIT
}
sub qb_jobcmp
{
my $joba = shift;
my $jobb = shift;
my $host = shift;
#
# Sort by priority and then by job id
#
$jobb->{priority} <=> $joba->{priority}
or
$jobb->{id} <=> $joba->{id};
}
sub qb_hostcmp
{
my $hosta = shift;
my $hostb = shift;
my $job = shift;
return 0;
}
sub qb_jobreject
{
my $job = shift;
my $host = shift;
#
# return 0 if nothing is wrong.
#
return 0;
}
Hopefully I am assuming that this is priority then job id based queuing.
I submitted about 30 simple jobs, (qbsub -hosts "yors0354, yors0355" set). Even these ran out of order using the Perl queue binding when allowed on 2 hosts with multiple CPUs available per host.
Am I reading this correctly that Qube is only looking at two jobs at a time? And I assume depending on where it is in the pending job list and resources become available they just start running even if they should technically not be next?
Is there a way for Qube to sort the global pending job list?
Also while testing this the log has
ERROR: caught exception in child.
and all the pending test jobs would not start running. I needed to restart the supervisor!
-
I've noticed the same issue on occasion. It infuriates some of the folks using the farm, but I have not had the chance to chase it down. This was originally on a 5.2.2 supervisor, but it seems to be the case with 5.4.0 as well. If the farm frees up I'll see if I can run a similar test.
-
Hi-
Jobs with the same priority not starting strictly in order of submission
is largely an artifact of the event-driven nature of the Qube architecture.
Make sure that you have the "retry_busy" set in the "supervisor_flags"
parameter in qb.conf (it's set by default). Without it, the supervisor
makes further short-circuitings for efficiencie's sake.
You may also try experiment adding "schedule" to the "supervisor_flags"
parameter in the supervisor's qb.conf. This will make the supervisor
use a iterative search to assign jobs to hosts. The drawback is that
it will lose the efficiency of the event-driven, and may degrade the
performance of the supervisor significantly.
-
ahh, undocumented features!
Our current setting...
supervisor_flags = host_recontact,heartbeat_monitor,running_monitor
is there an explanation for retry_busy?
Also, more info about "schedule" would be nice to have.
Thanks!!
-
Unsetting the retry_busy tells the supervisor, during job dispatch, to
move over to the next worker if the current candidate worker is busy
with something (which usually means it's busy starting up another subjob).
Normally, when retry_busy is set, the supe waits until the current
candidate worker tells it whether the last dispatch was accepted or not.
Unsetting retry_busy will also short-circuit some match-making
(between jobs and workers) decision logic, to speed up dispatching;
however, that comes with the penalty of compromising job dispatch
order, including the "host order" feature.
The "schedule" flag will enable an extra thread that runs a more
traditional iterative scheduling logic that periodically loops over
available hosts to find suitable jobs to be run. It also disables
most of the default event-driven logic which makes Qube very
efficient.
-
Thanks for the additional information.
We found that adding the "retry_busy" did not correct our FIFO problem. Jobs still run out of order, even when they have all the same requirements, priority, etc.
We will add the "supervisor" setting to see if it changes things.
Michael
-
Hi Michael, Jesse,
We've reproduced the issue, and indeed sometimes jobs can get
dispatched grossly out of order. We'll need to have a look at the
core code, and get back to you folks.
Thanks for your patience!
-
Thanks for the update. We will continue to wait patiently.
-
I wanted to bump this thread up, since we are seeing the same issue. We want to run in a simple FIFO setup as well and see the same problem with random ordering instead of by ID
Any update to this?
-
has there been any progress on this bug/problem???
-
We're running Qube 5.5.0 and are frequently seeing jobs scheduled out of order.
Are there any workarounds or additional configuration that anybody is able to share?
Steve
-
I'd be very interested in an official response to this issue.
I still have problems with the queue order on version 6.3.3.