Author Topic: Killing jobs - sometimes slow or unresponsive  (Read 10876 times)

jesse

  • Sr. Member
  • ****
  • Posts: 40
Killing jobs - sometimes slow or unresponsive
« on: October 14, 2008, 04:23:31 AM »
I am not sure if this is related solely to the GUI.  I've been using Qube for almost a year now and this issue has always bugged me.  Sometimes it seems impossible to kill a job.  Subjobs will stay in "Running" mode even after issuing the kill command.  In some instances, I've waited five minute before logging into a machine to kill the mayabatch process.  This seems to be the only way I could sort it out and free up the nodes.
Is my experience normal, or could I change something to get better performance?

My worker nodes are currently win64, but I have experienced similar problems with lin32 as well.

Thanks.

jesse

  • Sr. Member
  • ****
  • Posts: 40
Re: Killing jobs - sometimes slow or unresponsive
« Reply #1 on: November 06, 2008, 10:31:52 PM »
I would still like to know, is this normal? Should I adjust my expectations when killing jobs? . . . Or should I try to discover what is causing the problem?


jesse

  • Sr. Member
  • ****
  • Posts: 40
Re: Killing jobs - sometimes slow or unresponsive
« Reply #2 on: January 12, 2009, 10:58:56 PM »
I am still really interested in this topic.

I've since upgraded to the latest version of qube core and gui, 5.4.0 and 5.4.3 respectively.  I am still having this issue.

Can someone perhaps explain what the underlying sequence of events is when a subjob is killed?  Perhaps it would give me some ground to troubleshoot the issue on my own.

I've seen it take more than 10 minutes to kill a subjob.  On many occasions I am logging in to the working and killing the render process as qube is being too slow about it.


shinya

  • Administrator
  • *****
  • Posts: 232
Re: Killing jobs - sometimes slow or unresponsive
« Reply #3 on: January 14, 2009, 06:06:49 AM »
Hi jesse,

First, please accept my apologies for letting this one go unanswered for quite some time. 

The time that it takes to kill jobs can largely depend on how big your job is (how many subjobs and frames there are), and also on the particular application.

When a "qbkill" (or a "kill" on the gui) is issued, a message goes to the supervisor, which in turn finds the running subjob(s) on worker(s).  The applicable workers are notified that they need to kill those subjobs assigned to them.  In turn, the workers will send a signal to the running job process.  The job process is given a grace period (default 30 seconds) to clean up after itself before it's checked up on its status, and forcefully killed if necessary. 
(It really does a lot more than that, such as updating the local worker database, and cleaning up temp log files, but let's keep it at this for now)

When you see in the workerlog, a message like "process: <JOBID.SUBID> - <PROCESSID> remove timeout: blahblah...", that's telling us that the grace period expired and the job is being forcefully terminated.  Unfortunately, however, we have found that some application processes, especially on windows, can really take their time to exit even when attempted to be terminated forcefully. 

If you encounter such subjob in the future, please send us the workerlog of the particular worker that the subjob was running on, along with the joblog folder of the job you were trying to kill, so we may have a deeper look into it.

Thanks!