Author Topic: Pending and then failed.  (Read 3392 times)

sersia

  • Sr. Member
  • ****
  • Posts: 26
Pending and then failed.
« on: July 09, 2010, 11:45:50 AM »
Hello guys~

Our customer has a problem.

They are using 30 node workers.

The workers are grouped by named 'lix farm'.

If there is no running job Qube supervisor can get a job and make running very well.

But when there are running jobs with all of 30 node workers, for the first time Qube supervisor can get a next job that is supposed to use 30 cpus and make pending state.

I think the job should be waiting for the pre-jobs.

But the job's state change as failed soon.

The error message is like this.

/////////////////////////////////////////////////////////////
WARNING: cannot open scenefile [/input/SYS/world_now/3D/scenes/cut01_earth_c.mb] for reading at /usr/local/pfx/jobtypes/maya/Utils.pm line 35.

==================== INIT BEGIN ====================
HOME: /home/qubeproxy
QBDIR:
MAYA_LOCATION:
MAYA_PLUG_IN_PATH:
MAYA_SCRIPT_PATH:

Project Directory: /input/SYS/world_now/3D
Render Directory:  /input/SYS/world_now/3D/images
Scene File:        /input/SYS/world_now/3D/scenes/cut01_earth_c.mb

$VAR1 = bless( {
  'defaultRenderGlobals.currentRenderer' => 'mayaSoftware',
  'scenefile' => '/input\\SYS\\world_now\\3D\\scenes\\cut01_earth_c.mb',
  'project' => '/input\\SYS\\world_now\\3D',
  'range' => '1-90',
  'renderDirectory' => '/input\\SYS\\world_now\\3D/images',
  'ignoreRenderTimeErrors' => '1'
}, 'maya::Package' );

ERROR: scenefile [/input/SYS/world_now/3D/scenes/cut01_earth_c.mb] does not exist on the execution machine [render08] at /usr/local/pfx/jobtypes/maya/MayaJob.pm line 170.
INFO: reporting status [failed] to supe: qb::reportjob('failed')

////////////////////////////////////////////////////////////////////////////////

But there is no job on Qube, that job can run very well on Qube.

If Qube use just some cpus not all of cpus, next job is okay for running.

Just this problem is happend when Qube use all of cpus and next job want to use all of cpus.

Do you have any idea about this problem?

It's wrong. But I can not figure out alone.

I look forward to hearing from you.

Thank you for your concerned.

Have a nice weekend.




jburk

  • Administrator
  • *****
  • Posts: 493
Re: Pending and then failed.
« Reply #1 on: July 09, 2010, 04:18:13 PM »
So when the scene file is found when the farm is empty, but appears to be missing when the farm is busy?

I'd start by looking into the file server; can it only handle so many connections at the same time?  Are the workers dropping mounts when they're busy?

What OS do the workers run, how do they mount the shared filesystem, and what is the file server?  Is it an appliance or a host running an OS?  If a host, what OS does it run?

sersia

  • Sr. Member
  • ****
  • Posts: 26
Re: Pending and then failed.
« Reply #2 on: July 10, 2010, 02:46:12 AM »
It was worked before.

Yes, when the scene file is found when the farm is empty, but appears to be missing when the farm is busy.

We can see pending job on Qubu's GUI.

But when the farm is busy(using all of CPUS) that job is failed soon.

It is supposed to be keeping pending job.

They've changed this options

supervisor_max_threads=120
supervisor_max_clients=256
supervisor_idel_threads=50

and unchecked 'query SQL' in preference.

Before It was working very well. But suddenly that problem came out.

The Host and Worker is on LINUX for Red HAT(64bit).

And the Clients is on Windows XP and Windows7(64bit).

The host has a fileserver with NFS.

And Qube's core for host, worker, client is 5.5.2.

Just GUI, Maya Job type is 5.5.5.

More Details.

1 Host on Linux(64bit) ---- 30 Workers on Linux(64bit)  -> 1 group as lix farm
with fileserver            ---- 30 Workers on WinXP(32bit)-> 1 group as win fram
                               ---- 15 Workers on Linux(64bit) -> 1 group as blade farm(for Vray)

They have 60 Licenses of Qube.

And they don't use 30 workers on winXP because that is old and very slow machine.

Right now blade farm's 15 workers is running for the other jobs for V-Ray for maya.

Blade farm's job is okay with pending.

It is just happend with lix farm's for 30 CPUs.

!! It was worked before. But that problem came out some days ago. !!

Thank you jburk~
« Last Edit: July 10, 2010, 03:23:29 AM by sersia »

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Pending and then failed.
« Reply #3 on: July 15, 2010, 05:54:48 PM »
What is the file server that the scene is on? 

sersia

  • Sr. Member
  • ****
  • Posts: 26
Re: Pending and then failed.
« Reply #4 on: August 05, 2010, 01:07:14 PM »
Sorry for late..

That file server is raid harddrives for rendered and scene data.

That problem is gone now.

I don't know why...

anyway Thank you..