Author Topic: Stuck jobs  (Read 4869 times)

sosborne76

  • Sr. Member
  • ****
  • Posts: 41
Stuck jobs
« on: March 02, 2010, 09:43:54 AM »
I had a number of jobs submitted to our Qube farm recently which mostly completed e.g 96-100% yet a few frames continued rendering for hours and in a few cases a couple of days. I tried completing the individual frames through the GUI but they refused to complete. I tried shoving the jobs to no effect. I tried retiring the jobs to no effect and I tried preempting them to no effect. In the end the only things that would move things forward were if I either killed the frames or the job which seems counter productive given that they should have finished rendering. Or blocking the job and selecting those few frames that were blocked (i.e. the few that were previously running) and then completing them. But to be honest I don't think that causes them to complete properly as they complete with a time of 0 despite having been processing for hours.

Can anyone tell me what is going on here? And what commands I should try and use to resolve the issue? If its just a case of killing jobs that seems a bit of a failing on Qube's part that you can't recover from the situation especially if this happens regularly. And it gives a bad experience to my users.

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Stuck jobs
« Reply #1 on: March 02, 2010, 03:18:57 PM »
It's probably the application doing the processing (maya, max, whatever) has crashed or driven the machine in swap.

I'd suggest attempting to migrate the stuck subjobs off the machine; this performs an implied kill and retry of the frame the stuck subjob is working on, and it prevents that worker from accepting any more subjobs from that job for a period of time.

Marking a subjob as complete only takes effect when the subjob has finished processing the current agenda item it's working on, at which time the supervisor tells the subjob that it's complete.  Shoving a job instructs the supervisor to reevaluate a job for pending subjobs and work dispatch to workers; it has no effect on running subjobs and frames.

Retiring and preemption change what happens when a subjob requests more work from the supervisor; retiring is the same as complete, (the subjob just goes into a complete state even though there is more work to process), and preemption kicks the subjob off the worker and puts it back in a pending state.

sosborne76

  • Sr. Member
  • ****
  • Posts: 41
Re: Stuck jobs
« Reply #2 on: March 02, 2010, 03:59:50 PM »
Thanks, its handy to get a real world idea of what the commandline commands are useful for.

I will play around with migrate in the GUI. Should anything be done with the worker that was stuck or is it ok after the migrate?

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Stuck jobs
« Reply #3 on: March 02, 2010, 06:47:38 PM »
It should be OK after the migrate, although you may want to check the pagefile activity if it's a windows host and the job was expected to consume a lot of memory; once a Windows host goes into swap it's generally time for a reboot...
« Last Edit: March 02, 2010, 06:51:23 PM by jburk »

sosborne76

  • Sr. Member
  • ****
  • Posts: 41
Re: Stuck jobs
« Reply #4 on: March 03, 2010, 01:48:28 PM »
The migrate command seems to be moving things along.

At the moment a number of jobs which have been submitted are having this issue. Its a bit of administrative burden to keep on top of it, especially rebooting the Windows servers, given that we have no full time render wrangler.

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Stuck jobs
« Reply #5 on: March 10, 2010, 06:55:19 PM »
You can work on decreasing the memory footprint of your renders by either breaking it out into (more) layers, or trying to optimize the scene some more.

 Are you running multiple renders on the same host?  You could always set the job's reservations to "host.processors=1+" to take over the entire box.

Or can add more RAM to your workers if the scene is already pretty clean.

Tomio

  • Newbie
  • *
  • Posts: 1
Re: Stuck jobs
« Reply #6 on: April 15, 2010, 11:32:54 AM »
Hello,
We're having the same issue here, and it's starting to be a BIG problem.
Our Rendering Guy has to stay up all night, manually Killing the 99% completed renders to free up the Workers for other jobs.

We're on Softimage 7.01 / Windows XP 64bit
Is this a QUBE problem? or Is this something to do with Softimage?

I logged in one of those Render Workers and found a XSIBatch Runtime error.
Please see the attached image.

I have a 17 blade render farm, of which 11 of the workers have 16GB of RAM, and 6 have 12GB with the newest Nahalem processors.

Am I running out of memory?

Should I set the Page File settings in WindowsXP 64bit to an equal number as the physical memory, or should I set it to "System Managed Size."?

[attachment deleted by admin]

michael.graf

  • Sr. Member
  • ****
  • Posts: 26
Re: Stuck jobs
« Reply #7 on: April 17, 2010, 02:12:10 PM »
I would recommend using a monitoring tool like Cacti or Ganglia if you are wondering how your systems are performing and want to look at historical utilization.

www.cacti.net
www.ganglia.info

I have used the data from cacti to help provide justification for cluster expansions and purchase of additional licenses based on utilization and projected growth.

As for your errors, it all depends on what that runtime error code means.
If your system if paging, either increased the ram or reduce the job size or the number of jobs on the machine. Paging just kills calculation efficiency and should be avoided. Using a monitoring tool I previously mentioned will help you track this usage.

Also, ensure that any programs you run are executing with a proper batch mode. Hopefully when the program is run in batch mode it will not pop up a dialog box for user input. If it does, open a bug ticket with the company to have it corrected! I have personally ran into this type of batch mode problem before on the Windows version of a program, where as the linux release was correct. It was a license about to expire warning of all things!  ::)