Author Topic: status "running" for jobs that have finished (Read 27025 times)

apennington · « **on:** January 21, 2009, 11:04:28 PM »

Hi,
I have searched the forums and found some similar reports but no definitive answers for this issue. Some jobs that use multiple CPUs never complete because a subset of the CPUs hang - the workers need to be restarted to free up these CPUs for subsequent jobs. At first I thought it was because the test jobs were so simple that it only required a few CPUs, but this is not the case because I have tested with a job that took 33 minutes across 16 CPUs. 5 of the CPUs never rendered a single frame and would not complete until I rebooted the servers, after which the GUI reported that the job had completed. Why is this happening? Also, any attempts to complete or kill the hanging job and subjobs from within the GUI always fail.

Thanks,
Andrew

shinya · « **Reply #1 on:** January 23, 2009, 03:43:16 AM »

Hi apennington,

Your extra subjobs (or "CPUs") shouldn't just "hang" once the frames of your
jobs are done, let alone require you to reboot the workers.

Would you be able to send us, at support < a t > pipelinefx.com, the joblogs
from the job in question, so that we may have a more in-depth look? Instructions
are included below FYI.

Also, when you try to manipulate jobs from the GUI, are you doing it as the
same user that submitted the job, or as one of the qube administrators?

----

In order to address your problem more completely, please send us the job log directory for the job in question.

You can locate the directory by logging into your Supervisor, and looking for the job log folder in the following location (depending upon your Supervisor platform):

Windows
\Program Files\pfx\qube\logs\job

Linux, OS X
/var/spool/qube/job

In that folder, you will find a numbered directory that corresponds to the number of thousands in the job ID. (ID < 1000 = 0) Search in one of these folders for the one that corresponds to the correct job ID.

Zip up the entire job ID folder and reply to this email message with the zipped file.

If the zipped job ID folder turns out to be larger than 2MB, don't send it, but let us know and we will help you address the problem in an alternative fashion.

apennington · « **Reply #2 on:** January 27, 2009, 04:16:31 PM »

Shinya,

I've e-mailed the job log as you requested. In this case, the user that is trying to manipulate the jobs from the GUI is not the same user that submitted the job. Do I need to make the user a qube administrator and if so how do I do this?

shinya · « **Reply #3 on:** January 27, 2009, 11:30:33 PM »

Thanks for the logs. I'll be looking into it shortly.

As for manipulating jobs through the GUI (and command line as well),
yes, you'd either need to be the same user that submitted the jobs,
or an administrator with the necessary permissions.

To set up a user to be an administrator, you use the "qbusers" command
on the command prompt. Sorry, we don't have a GUI equivalent at this
time. To add user "fred" to be an administrator with full privilege to
do pretty much anything, for example:

qbusers -add -all -admin -sudo -impersonate -lock fred

qbusers lives in the "sbin" folder of your qube installation (i.e.
c:\program files\pfx\qube\sbin on windows), so you may need
to add that path to your PATH environment variable.

For more info on qbusers, please refer to the Administration.pdf doc.

shinya · « **Reply #4 on:** January 27, 2009, 11:42:15 PM »

I looked at your 3dsmax job log, and see that indeed some of the
subjobs were "stuck" even after the frames are completed.

3dsmax is known to do this, and there are a few things that
you can try to avoid it.

1. make sure the execution user (I think you're using
proxy execution mode, so the "qubeproxy" user) on the workers
have administrator rights.

2. submit "3dsmax batch render" jobs. This mode is not as
involved as the non-batch mode, but is know to work more
predictably.

3. if your workers are windows vista or windows 2008, make sure
ACL is disabled, and that the "interactive services dialog
detection service" is also disabled.

apennington · « **Reply #5 on:** January 29, 2009, 08:17:25 PM »

Shinya,

Thanks for your suggestions.

1. Yes, the qubeproxy user is a local admin on the workers.

2. We will try the 3ds max batch render submittal and let you know if it works. Just need to read up on it first.

3. We are not using vista or 2008.

apennington · « **Reply #6 on:** February 05, 2009, 05:55:27 PM »

We haven't had any luck with the batch render. I'm not sure if we're submitting them correctly. At a minimum, what fields need to be filled out on the submittal form? We're reading the Autodesk documentation on command line rendering. Does PFX have any docs we could reference?

rick rubin kgo sf · « **Reply #7 on:** March 05, 2009, 05:59:01 AM »

Read your thread here and we too are having difficulty... I'd love to see how you cleared this up. Your last note message is early February. Did you resolve the problem?

apennington · « **Reply #8 on:** March 06, 2009, 03:39:03 PM »

Quote from: rick rubin kgo sf on March 05, 2009, 05:59:01 AM

Read your thread here and we too are having difficulty... I'd love to see how you cleared this up. Your last note message is early February. Did you resolve the problem?

We are still experiencing this and working with PipelineFX's support team to try and resolve the problem of the 3ds Max 2009 jobs getting stuck in the running state and not releasing the CPUs until the workers are restarted. Good to know we're not the only ones, though.
It doesn't happen all the time, but in some recent cases the "stuck" worker shows the following error(s):
"Microsoft Visual C++ Runtime Library: Runtime Error! Program: C:... R6025 - pure virtual function call." More can be found about this error here: http://support.microsoft.com/kb/125749
Once we "end task" on the Runtime, the job resumes. Hope this helps.

rick rubin kgo sf · « **Reply #9 on:** March 25, 2009, 10:47:21 PM »

Scot,

I find that when using the 3d Max Batch process as recommended, I go and ?kill? the stuck frame still 'running' and then after a few minutes, 'retry' them; using the right click menu. Most every time a new processor is assigned and then completes the frame.

Its bit more monitoring than I expected but at least for now, I don't have to submit a new job with just the frames missing. I can use what is currently in the job list.

Author Topic: status "running" for jobs that have finished (Read 27025 times)

apennington

status "running" for jobs that have finished

shinya

Re: status "running" for jobs that have finished

apennington

Re: status "running" for jobs that have finished

shinya

Re: status "running" for jobs that have finished

shinya

Re: status "running" for jobs that have finished

apennington

Re: status "running" for jobs that have finished

apennington

Re: status "running" for jobs that have finished

rick rubin kgo sf

Re: 3ds Max jobs Locking up...

apennington

Re: status "running" for jobs that have finished

rick rubin kgo sf

Re: status "running" for jobs that have finished - Follow up...