Author Topic: Qube! 6.3.6 Core/Supervisor/Worker maintenance release is available  (Read 7708 times)

shinya

  • Administrator
  • *****
  • Posts: 232
A new 6.3.6 maintenance release is now available for the Qube 6.3Core/Supervisor/Worker. This is a recommended release for all customers running Qube v6.3.

=======================================================
    Highlights
=======================================================
* many fixes for out-of-order dispatch issues


Below is a detailed list of the fixes and enhancements since the last point-release.

===========================================
Core / Supervisor / Worker changes
===========================================

##############################################################################
@RELEASE: 6.3.6

==== CL 10462 ====
@FIX: yet yet another fix for out-of-order dispatch behavior-- eliminate race-condition that would allow lower priority jobs that were just preempted to get workers before higher-priority jobs.
See also CL10440 10452

ZD: 8198

==== CL 10461 ====
@CHANGE: modified/compacted the multi-line "found a duty to replace" logging to be a single line.

==== CL 10452 ====
@FIX: yet another fix for out-of-order dispatch behavior-- eliminate race-condition that would allow lower priority jobs that were just preempted to get workers before higher-priority jobs.
See also CL10440

ZD: 8198

==== CL 10441 ====
@FIX: killing an already finished (complete, failed, killed) job leaves the job in the "dying" state.

==== CL 10440 ====
@FIX: another fix for out-of-order dispatch behavior-- eliminate race-condition that would allow lower priority jobs that were just preempted to get workers before higher-priority jobs.

ZD: 8198

==== CL 10429 ====
@FIX: out-of-order job dispatching issue with jobs using the "+" sign with the "host.processors" reservations.

ZD: 8198 8261 8229 8233 8228

==== CL 10189 ====
@FIX: timing issue where some worker resources (host.xyz) would disappear after the worker received a remote config.

@FIX: issue where supervisor tries to dispatch a subjob to a worker with
insufficient resources (reduced the likeliness of that from happening)

@FIX: the above 2 fixes combined should now prevent some of the
out-of-priority-order dispatch issues, especially in environments where
worker resources are deployed.

ZD: 7885

==== CL 10118 ====
@FIX: fixed issue where agenda timeouts don't work properly on the first agenda item processed by a subjob, on Unix (Linux/OSX) workers

==== CL 10117 ====
@FIX: fixed issue where agenda items that fail because of timeout don't get automatically retried via retrywork
ZD: 7763

==== CL 10022 ====
@FIX: modified the worker to only report to the supe of its host status when subjobs are completely done and removed, and NOT when they are only marked/scheduled for removal.

This was causing jobs to sometimes run out-of-order, especially when there
are many subjobs to each job (such as one subjob per frame), since that
situation tends to increase the chance of the supervisor dispatching the
same subjob to the same worker. The subjob will be dispatched to the same
worker, but rejected since the worker thinks it's a duplicate assignment of
a subjob that's being removed (and consequently a lower priority job will
get the worker's slot, causing out-of-order job execution)

ZD: 7601

==== CL 9903 ====
@FIX: better message from worker when it rejects a dispatched subjob because it's a duplicate (being preempted or migrated on the same worker)

==== CL 9838 ====
@CHANGE: upped the default value for supervisor_max_threads to 100, and worker_max_threads to 32