Author Topic: FIFO not working correctly, even with custom algorithm  (Read 11629 times)

michael.graf

  • Sr. Member
  • ****
  • Posts: 26
FIFO not working correctly, even with custom algorithm
« on: November 25, 2008, 12:23:54 PM »
We have had this problem for a while, initially Case 13105.

Currently we have:
supervisor booting - version: 5.3-0 build: bld-5-3-2008-03-28-3.
supervisor_queue_binding: Internal
supervisor_queue_library: queue

As our job list grows, different users submit same cluster and priority jobs (/, 1000) with same resource requirements.  But instead of the lowest job id going first, it seems to start running MUCH later jobs first in groups in what seems to be based on username.
Needless to say designers are starting to get rather upset now.

I even copied the example algorithm and changed to
supervisor_queue_binding: Perl
supervisor_queue_library: E:\pfx\qube\etc\VSalgorithm.pm

Code: [Select]

sub qb_init
{
print <<INIT;
#########################################################################################

Copyright: PipelineFX L.L.C.
All Rights Reserved.

Software: Qube!

Purpose: supervisor queuing algoritm replacement perl module.

This is an example module to be used for reference in
building a custom queuing algoritm.

Qube! license holders may modify this module for their own private use.

#########################################################################################
INIT
}

sub qb_jobcmp
{
my $joba = shift;
my $jobb = shift;
my $host = shift;

#
# Sort by priority and then by job id
#

$jobb->{priority} <=> $joba->{priority}
or
$jobb->{id} <=> $joba->{id};
}

sub qb_hostcmp
{
my $hosta = shift;
my $hostb = shift;
my $job = shift;

    return 0;
}

sub qb_jobreject
{
my $job = shift;
my $host = shift;

#
#  return 0 if nothing is wrong.
#
return 0;
}


Hopefully I am assuming that this is priority then job id based queuing.
I submitted about 30 simple jobs, (qbsub -hosts "yors0354, yors0355" set).  Even these ran out of order using the Perl queue binding when allowed on 2 hosts with multiple CPUs available per host.

Am I reading this correctly that Qube is only looking at two jobs at a time?  And I assume depending on where it is in the pending job list and resources become available they just start running even if they should technically not be next?

Is there a way for Qube to sort the global pending job list?

Also while testing this the log has
ERROR: caught exception in child.
and all the pending test jobs would not start running.  I needed to restart the supervisor!

jesse

  • Sr. Member
  • ****
  • Posts: 40
Re: FIFO not working correctly, even with custom algorithm
« Reply #1 on: November 26, 2008, 01:04:56 AM »
I've noticed the same issue on occasion.  It infuriates some of the folks using the farm, but I have not had the chance to chase it down.  This was originally on a 5.2.2 supervisor, but it seems to be the case with 5.4.0 as well.  If the farm frees up I'll see if I can run a similar test.

shinya

  • Administrator
  • *****
  • Posts: 232
Re: FIFO not working correctly, even with custom algorithm
« Reply #2 on: November 27, 2008, 07:05:22 AM »
Hi-

Jobs with the same priority not starting strictly in order of submission
is largely an artifact of the event-driven nature of the Qube architecture.

Make sure that you have the "retry_busy" set in the "supervisor_flags"
parameter in qb.conf (it's set by default).  Without it, the supervisor
makes further short-circuitings for efficiencie's sake.

You may also try experiment adding "schedule" to the "supervisor_flags"
parameter in the supervisor's qb.conf.  This will make the supervisor
use a iterative search to assign jobs to hosts.  The drawback is that
it will lose the efficiency of the event-driven, and may degrade the
performance of the supervisor significantly.




michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: FIFO not working correctly, even with custom algorithm
« Reply #3 on: November 27, 2008, 01:37:34 PM »
ahh, undocumented features!

Our current setting...

supervisor_flags = host_recontact,heartbeat_monitor,running_monitor

is there an explanation for retry_busy?
Also, more info about "schedule" would be nice to have.

Thanks!!

shinya

  • Administrator
  • *****
  • Posts: 232
Re: FIFO not working correctly, even with custom algorithm
« Reply #4 on: December 03, 2008, 05:54:57 AM »
Unsetting the retry_busy tells the supervisor, during job dispatch, to
move over to the next worker if the current candidate worker is busy
with something (which usually means it's busy starting up another subjob).
Normally, when retry_busy is set, the supe waits until the current
candidate worker tells it whether the last dispatch was accepted or not.

Unsetting retry_busy will also short-circuit some match-making
(between jobs and workers) decision logic, to speed up dispatching;
however, that comes with the penalty of compromising job dispatch
order, including the "host order" feature.

The "schedule" flag will enable an extra thread that runs a more
traditional iterative scheduling logic that periodically loops over
available hosts to find suitable jobs to be run.  It also disables
most of the default event-driven logic which makes Qube very
efficient.



michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: FIFO not working correctly, even with custom algorithm
« Reply #5 on: December 16, 2008, 09:27:22 AM »
Thanks for the additional information.

We found that adding the "retry_busy" did not correct our FIFO problem.  Jobs still run out of order, even when they have all the same requirements, priority, etc.

We will add the "supervisor" setting to see if it changes things.


Michael

shinya

  • Administrator
  • *****
  • Posts: 232
Re: FIFO not working correctly, even with custom algorithm
« Reply #6 on: April 04, 2009, 01:31:31 AM »
Hi Michael, Jesse,

We've reproduced the issue, and indeed sometimes jobs can get
dispatched grossly out of order.  We'll need to have a look at the
core code, and get back to you folks.

Thanks for your patience!

michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: FIFO not working correctly, even with custom algorithm
« Reply #7 on: April 06, 2009, 11:53:16 AM »
Thanks for the update.  We will continue to wait patiently.

justin

  • Jr. Member
  • **
  • Posts: 6
Re: FIFO not working correctly, even with custom algorithm
« Reply #8 on: July 14, 2009, 07:01:44 PM »
I wanted to bump this thread up, since we are seeing the same issue. We want to run in a simple FIFO setup as well and see the same problem with random ordering instead of by ID
Any update to this?

michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: FIFO not working correctly, even with custom algorithm
« Reply #9 on: September 07, 2009, 11:39:40 AM »
has there been any progress on this bug/problem???

stevespo

  • Jr. Member
  • **
  • Posts: 4
Re: FIFO not working correctly, even with custom algorithm
« Reply #10 on: January 05, 2010, 08:59:55 PM »
We're running Qube 5.5.0 and are frequently seeing jobs scheduled out of order. 
Are there any workarounds or additional configuration that anybody is able to share?

Steve

jesse

  • Sr. Member
  • ****
  • Posts: 40
Re: FIFO not working correctly, even with custom algorithm
« Reply #11 on: May 30, 2012, 02:14:40 AM »
I'd be very interested in an official response to this issue.

I still have problems with the queue order on version 6.3.3.