Author Topic: unable to find a job to start on host (logs included)  (Read 5693 times)

neight

  • Jr. Member
  • **
  • Posts: 9
    • URI Pharmacy Animations
unable to find a job to start on host (logs included)
« on: July 10, 2007, 06:28:29 PM »
So the issue is, I have the client using the qube GUI to submit a job... it seems to be working OK but then it fails out...  all the computers are macs, the workers are qube proxy..

the scene file is located on the server, in a shared folder.  The worker, and the client are both logged in, and have access to this shared folder (read and write). 

below are the log files from the super and the worker... it seems to boil down to 'unable to find a job to start on the host'.  from the command line, both the worker and the client have access to the /Volume path to the scene files, however, the syntax doesn't work on the super.. (i don't know if it even has to or not).

anyway, here's the logs.. any feedback would be appreciated.

-n8

----------------------
log file from super

===== Tuesday, July 10, 2007 1:42:24 PM US/Eastern =====
[Jul 10, 2007 13:43:33] 105.17.128.131.dhcp.uri.edu : job query received: 131.128.17.219
[Jul 10, 2007 13:43:35] 105.17.128.131.dhcp.uri.edu : host query received: 131.128.17.219
[Jul 10, 2007 13:43:35] 105.17.128.131.dhcp.uri.edu : host query received: 131.128.17.219
[Jul 10, 2007 13:43:39] 105.17.128.131.dhcp.uri.edu : job query received: 131.128.17.219
[Jul 10, 2007 13:43:41] 105.17.128.131.dhcp.uri.edu : wrote stderr data to file: /var/spool/qube/job/0/265/265_0.err size: 89
[Jul 10, 2007 13:43:41] 105.17.128.131.dhcp.uri.edu : wrote stdout data to file: /var/spool/qube/job/0/265/265_0.out size: 89
[Jul 10, [Jul 10, 2007 13:43:41] 105.17.128.131.dhcp.uri.edu : job query received: 131.128.17.219
2007 13:43:41] 105.17.128.131.dhcp.uri.edu : retrying in supervisor by administrator from 131.128.17.219: 265.0
[Jul 10, 2007 13:43:41] 105.17.128.131.dhcp.uri.edu : assigning job 265 to 223.17.128.131.dhcp.uri.edu (131.128.17.223)
[Jul 10, 2007 13:43:41] 105.17.128.131.dhcp.uri.edu : job query received: 131.128.17.219
[Jul 10, 2007 13:43:42] 105.17.128.131.dhcp.uri.edu : worker 223.17.128.131.dhcp.uri.edu (131.128.17.223) has accepted job: 265.0
[Jul 10, 2007 13:43:42] 105.17.128.131.dhcp.uri.edu : host accept status received: 131.128.17.223 host.processors=1/2,host.memory=546/1024
[Jul 10, 2007 13:43:43] 105.17.128.131.dhcp.uri.edu : worker reports: 223.17.128.131.dhcp.uri.edu (131.128.17.223) subjob: 265.0 - running seq: 104
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : worker reports: 223.17.128.131.dhcp.uri.edu (131.128.17.223) subjob: 265.0 - complete seq: 109
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : remove order sent to worker: 131.128.17.223 subjob: 265.0
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : received stdout data in report for job: 265.0 - complete
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : wrote stdout data to file: /var/spool/qube/job/0/265/265_0.out size: 454
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : received stderr data in report for job: 265.0 - complete
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : wrote stderr data to file: /var/spool/qube/job/0/265/265_0.err size: 3204
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : host status received: 131.128.17.223 host.processors=0/2,host.memory=546/1024
[Jul 10, 2007 13:43:45] 105.17.128.131.dhcp.uri.edu : unable to find a job to start on host 223.17.128.131.dhcp.uri.edu (131.128.17.223)
[Jul 10, 2007 13:43:51] 105.17.128.131.dhcp.uri.edu : host status received: 131.128.17.223 host.processors=0/2,host.memory=546/1024
[Jul 10, 2007 13:43:51] 105.17.128.131.dhcp.uri.edu : unable to find a job to start on host 223.17.128.131.dhcp.uri.edu (131.128.17.223)
-------------------------------------------

log for worker

===== Tuesday, July 10, 2007 1:43:17 PM US/Eastern =====
[Jul 10, 2007 13:43:42] 223.17.128.131.dhcp.uri.edu : INFO: new job qualifies: 265.0
[Jul 10, 2007 13:43:42] 223.17.128.131.dhcp.uri.edu : INFO: unable to find logdir, building job's log directory.
[Jul 10, 2007 13:43:42] 223.17.128.131.dhcp.uri.edu : received start order for new job: 265.0
[Jul 10, 2007 13:43:42] 223.17.128.131.dhcp.uri.edu : INFO: using qubeproxy in place of administrator.
[Jul 10, 2007 13:43:42] 223.17.128.131.dhcp.uri.edu : job 265.0 process id: 316
[Jul 10, 2007 13:43:43] 223.17.128.131.dhcp.uri.edu : running job status report sent to supervisor: 265.0
[Jul 10, 2007 13:43:43] 223.17.128.131.dhcp.uri.edu : received request for job details: 265.0
[Jul 10, 2007 13:43:43] 223.17.128.131.dhcp.uri.edu : received status report from proxy: 265.0 - running seq: 99
[Jul 10, 2007 13:43:43] 223.17.128.131.dhcp.uri.edu : gathering stats on job: 265.0
[Jul 10, 2007 13:43:43] 223.17.128.131.dhcp.uri.edu : sending report to supervisor for job: 265.0 - running seq: 104
[Jul 10, 2007 13:43:44] 223.17.128.131.dhcp.uri.edu : supervisor 131.128.17.105 confirmed report 265.0
[Jul 10, 2007 13:43:44] 223.17.128.131.dhcp.uri.edu : sent logs 265.0 0 - bytes.
[Jul 10, 2007 13:43:44] 223.17.128.131.dhcp.uri.edu : received status report from proxy: 265.0 - complete seq: 99
[Jul 10, 2007 13:43:44] 223.17.128.131.dhcp.uri.edu : returning work for job: 265 total items: 0
[Jul 10, 2007 13:43:44] 223.17.128.131.dhcp.uri.edu : gathering stats on job: 265.0
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : reading /var/spool/qube/job/0/265/265_0.out to transmit back.
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : reading /var/spool/qube/job/0/265/265_0.err to transmit back.
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : sending report to supervisor for job: 265.0 - complete seq: 109
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : sent logs 265.0 3658 - bytes.
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : scheduling job for removal: 265.0
[Jul 10, 2007 13:43:45] 223.17.128.131.dhcp.uri.edu : sending host status report to the supervisor.
[Jul 10, 2007 13:43:46] 223.17.128.131.dhcp.uri.edu : supervisor 105.17.128.131.dhcp.uri.edu host report - report successful.
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : process: 265.0 - 316 finally dead: complete
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : releasing resources for: 265.0 res: 'host.processors=1'
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : running unix cleanup.
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : terminated process: 265.0 - 316
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : removed log directory /var/spool/qube/job/0/265
[Jul 10, 2007 13:43:50] 223.17.128.131.dhcp.uri.edu : removed job 265.0
[Jul 10, 2007 13:43:51] 223.17.128.131.dhcp.uri.edu : sending host status report to the supervisor.
[Jul 10, 2007 13:43:51] 223.17.128.131.dhcp.uri.edu : supervisor 105.17.128.131.dhcp.uri.edu host report - report successful.



anthony

  • Senior Software Engineer
  • Hero Member
  • *****
  • Posts: 183
Re: unable to find a job to start on host (logs included)
« Reply #1 on: July 10, 2007, 09:52:08 PM »
Hey Neight,

    In answer to the embeded comment, the supervisor does not need access to your scene data.  It only manages the jobs.  As far as the reason why your renders are bombing out, I'll need just a little more information.  What kind of jobs are you using?   Also are you using AFS or NFS for file sharing between these hosts?

    Thanks,
          Anthony

neight

  • Jr. Member
  • **
  • Posts: 9
    • URI Pharmacy Animations
Re: unable to find a job to start on host (logs included)
« Reply #2 on: July 11, 2007, 03:15:08 PM »
Hi anthony,

we're trying to submit Maya jobs...

as for file sharing im not sure what type it is.  ... dunno if this helps but, the way i set up the shared directory was using os x's built in folder sharing options... i have the other computers auto-connect to the server, and access the shared directory.  I don't know if that is AFS or NFS, and if that doesn't help i could research it further...

-n8

anthony

  • Senior Software Engineer
  • Hero Member
  • *****
  • Posts: 183
Re: unable to find a job to start on host (logs included)
« Reply #3 on: July 18, 2007, 11:16:17 PM »
Hey Neight,

   If you are using OSX and are taking advantage of the gui to mount your drives, more than likely you are using AFS (Apple File System).  The problem with AFS is that it restricts access to the user which mounted the drives in the first place.  Typically this represents a problem because the workers for Qube! normally run under a proxy user account which doesn't have access to those volumes.  Also you must mount the volumes exactly the same way on every single host.   The most reliable method for connecting drives on OSX is using NFS.  However that doesn't come without it's share of administrative pain.  Before you begin, you must really put thought into where you are planning on storing your work data.  Are you planning on a farm expansion in the future?

          Thanks,
             Anthony

neight

  • Jr. Member
  • **
  • Posts: 9
    • URI Pharmacy Animations
Re: unable to find a job to start on host (logs included)
« Reply #4 on: July 23, 2007, 05:51:29 PM »
I spoke with somebody when I called for tec support a while back... I think it was Eric, and he mentioned something about having the volumes automount to get around this access issue.... do you know how to go about doing this?  Perhapse that is the NFS approach you mentioned...

Is it this perhapse?
http://www.atmos.washington.edu/~salathe/osx_unix/nfsmount.html


I want to have a NAS box hold all the output the workers generate in the end.  I also want to able to add more workers to the farm after the initial test run of Qube!  Right now, we plan to have 5 workers (3 of which are clients when in use) a supervisor and a NAS box.

I've dealt with quite a bit of pain to get to this point with Qube!, some more is expected.  I just want to get this farm up and running correctly as soon as possible.  So if NFS is the way to go, thats fine with me. 

Please let me know what to do next... thanks!



« Last Edit: July 23, 2007, 05:57:07 PM by neight »

neight

  • Jr. Member
  • **
  • Posts: 9
    • URI Pharmacy Animations
Re: unable to find a job to start on host (logs included)
« Reply #5 on: July 25, 2007, 08:16:02 PM »
OK so I set up NFS on the machines.... the super has an export set up that allows any client to mount it, and it gives read/write access to all the machines connect to the mount.... this is all set up thru netinfo of course.

So I double checked read/write access on the machines involved with my setup, and they are all good to go as far as that goes.

I then copied my maya scene files to the shared mount... then i used the Qube! GUI to try and process a maya job, pointed it at the file on the shared mount and told it to go..... the job failed in a matter of seconds..

after digging thru the logs, it seems the same problem is present, 'unable to find a job to start on host'

..... any ideas?

i have the logs saved in email, and i can foward those along if that helps.

-nate

ps. here are the links i found helpful in setting up NFS.. if anyone else needs to do such:

http://mactechnotes.blogspot.com/2005/09/mac-os-x-as-nfs-server.html
http://www.macobserver.com/tips/hotcocoa/2001/20010723.shtml
http://www.atmos.washington.edu/~salathe/osx_unix/nfsmount.html


eric

  • Hero Member
  • *****
  • Posts: 229
Re: unable to find a job to start on host (logs included)
« Reply #6 on: July 25, 2007, 10:05:30 PM »
Without looking at your actual job logs, it would be difficult for us to diagnose the problem. Can you take a look at the STDERR and STDOUT tabs in the QubeGUI for the output logs? You can either excerpt them and post them here, or submit it to the PFX support email.

Most likely, your problem lies with access to the scenefile. Either:
  • The scenefile is accessible because the mounts aren't working correctly.
  • The scenefile is accessible because the qubeproxy user doesn't have permission to read it.

neight

  • Jr. Member
  • **
  • Posts: 9
    • URI Pharmacy Animations
Re: unable to find a job to start on host (logs included)
« Reply #7 on: July 27, 2007, 06:42:53 PM »
Eric,

Thanks for the help, it turns out that it was a problem with the Qube! proxy user not having access.

I had the mount point on the desktop of the admin's account on the client/worker machines... yeah, fixed that.

Qube! is running with proxy users setup here... I'll set up LDAP logins and such later.. the next thing I want to do is hook up 3 Mac Pro Xeon 64-bit workstations up to the farm.  I remember reading something about 64 bit support in a new version, so I'm about to look into that...

thanks again anthony and eric for your help on this issue, consider this issue closed.