Author Topic: Auto Locking Failed Workers  (Read 5835 times)

awells

  • Jr. Member
  • **
  • Posts: 8
Auto Locking Failed Workers
« on: November 26, 2008, 01:12:23 AM »
Hello,

I'm having and issue where, when our farm is full, a single worker will cause all of the jobs on our farm to fail. Basically the worker will fail and become the only idle, unlocked worker. And because all of the other workers are busy or locked, it will be picked up by the next pending job, and proceed to fail it as well. This continues until all pending jobs on the farm have failed.  When the farm is being monitored, the fix is as simple as rebooting the problematic machine and restarting the failed jobs. Is there a way to automatically lock a worker when it fails a certain number of subjobs within a certain period of time?  I'd also like the machine to automatically lock, reboot, unlock, and then retry the failed subjobs. But I'd be satisfied with a way to automatically lock for now. Any help is greatly appreciated. Thanks   

Scot Brew

  • Hero Member
  • *****
  • Posts: 272
    • PipelineFX
Re: Auto Locking Failed Workers
« Reply #1 on: November 26, 2008, 08:12:10 PM »
We have received the request for an "auto wrangler" option to lock Workers if they fail consecutive jobs and also block jobs that continually fail on Workers.  To do so without causing the pathological cases where all the Workers are locked or all the jobs are blocked will take a bit of work.

We do have such an "auto wrangler" feature in the development plan, though it is not slated for immediate upcoming releases.

michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: Auto Locking Failed Workers
« Reply #2 on: November 27, 2008, 01:25:31 PM »
looking at it another way, has the cause of the worker failure been determined?  Is it OS or software related? Do you mean the Pipelinefx worker daemon or worker as in the host node? If this is happening repeatedly, then finding the root cause would be my suggestion.

What does the job report as failed?

Before an "auto wrangler" is available, depending on the problem, a callback might be able to interrogate the worker to check "something".

We struggled for a few months with random failing jobs.  Performing hardware diags, and creating extensive debugging scripts that monitored the jobs while running all came back with nothing.  There was no real reason the job should be failing.  We contacted the software vendor and after weeks of convincing them it was not our fault (hardware, OS config, runtime env, etc.) they finally released a special build of their software to us.  Not a single unexplained job failure since.


awells

  • Jr. Member
  • **
  • Posts: 8
Re: Auto Locking Failed Workers
« Reply #3 on: December 24, 2008, 12:13:40 AM »
Thank you both for your replies.  I was referring to the actual host node, as opposed to the qube worker daemon. We weren't able to determine the cause of the failures, but upgrading the qube core and worker daemon (5.3 to 5.4) on our machines seems to have solved the problem.

sz

  • Jr. Member
  • **
  • Posts: 7
Re: Auto Locking Failed Workers
« Reply #4 on: January 05, 2010, 09:06:12 PM »
Hi

I have been having exactly the same problem for a while now

"a single worker will cause all of the jobs on our farm to fail. Basically the worker will fail and become the only idle, unlocked worker. And because all of the other workers are busy or locked, it will be picked up by the next pending job, and proceed to fail it as well. This continues until all pending jobs on the farm have failed.  When the farm is being monitored, the fix is as simple as rebooting the problematic machine and restarting the failed jobs."

I have updated all of the systems to the latest qube and I am still having this problem but only on the OS X 10.5 systems. The Fedora boxes never have this problem. So it is clear it is a problem with OS X.

This glitch/error occurring on the macs is making the entire render-farm unusable as the majority of our systems are Macs

Any suggestions please.

Thanks
sz