PipelineFX Forum
Qube! => General => Topic started by: BarsTone on July 07, 2010, 11:38:23 PM
-
Hey all,
I've been dealing with the issue of subjobs occasionally getting "stuck" and rendering endlessly until our Render Guy angrily kills and retries them and then restarts the workers that were executing those jobs. This is very bad because it costs us valuable rendering time. So I was hoping to either find a way to make it stop, or to get some help writing a python script to monitor the jobs and automatically retry the chunks and restart the workers when it detects that a chunk has been rendering beyond a certain time limit. Here's our setup:
- 3dsmax 2010, Qube 5.5.0
- 8 BOXX 16-core render nodes and 3 quad-core desktops
- All computers running XP 64-bit except one of the desktops is 32-bit
- Worker1 is the Qube supervisor
Everything will be going along fine, with each chunk taking maybe 20 minutes. So we'll let it run while we do other things. Then we'll come back in two hours, and one or two workers (different ones each time) will show that they've been rendering their chunks for the past hour, while the other workers have completed the rest of the chunks in that job and moved on to the next job. If we kill and retry those stuck chunks, sometimes their workers will latch on and complete them in 20 minutes. Other times they won't and we'll have to restart them, which also fixes the problem. Looking at the desktops of the stuck workers, sometimes they have several instances of 3dsmax.exe running, but all are Not Responding and using 0% CPU.
Does anyone on here have any experience with this issue? I saw a few posts about it, but nothing conclusive. If there's no way to prevent the subjobs from getting stuck, the other option is to have a simple python script that monitors the subjobs that are running and automatically restarts the offending worker and retries the subjob if it triggers some condition. Hopefully the condition can be "x minutes longer than the average chunk completion time for that job". Any ideas?
Many thanks for any help,
Owen