Author Topic: Worker machine goes offline - qube sticks on that subjob  (Read 3833 times)

jbrandibas

  • Sr. Member
  • ****
  • Posts: 35
Worker machine goes offline - qube sticks on that subjob
« on: May 30, 2009, 01:50:15 PM »
I am having an issue where a worker has a failure of some type (we are not sure whether it is hardware or software yet), and the currently running subjob on that node gets stuck and cannot be killed until the node is brought back online.  Is there a way to forcibly kill a running job in this instance?  I have tried both the gui kill and the qbkill command.  Both say they are killing the job, but it never dies.  I have one that has been sitting here for 12 hours and hasn't died yet.  The node is visible via the ping and qbping command, however I cannot access it via ARD (it is an OSX box).

shinya

  • Administrator
  • *****
  • Posts: 232
Re: Worker machine goes offline - qube sticks on that subjob
« Reply #1 on: June 03, 2009, 03:52:01 AM »
The supervisor should be able to kill a subjob that's running on
an unreachable worker.  However, in your case, it seems like the worker
machine is not completely unreachable, and that's probably causing the
system to not quite able to kill the job.

If you're able to log into the machine via ssh, you could try to reboot
the machine, or at least kill the worker process (by running
"/Library/StartupItems/worker/worker stop" as root).

Otherwise, see if the following from a command prompt on a Qube-
installed machine, as one of the qube admins, work:

 qbadmin worker -reboot <HOSTNAME>

or

 qbadmin worker -shutdown <HOSTNAME>

where <HOSTNAME> is your worker's hostname or IP address.