Author Topic: Customize AutoWrangling?  (Read 12683 times)

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Customize AutoWrangling?
« on: November 30, 2012, 02:28:00 PM »
Hi there,

is it possible to customize Autowrangling behaviour? Specifically it seems to have a lot of false positive locking on workers. For example a job might have a setup problem that makes it fail. Hence all workers on that job would retry their agenda items, fail a few times and then all workers are locked. While the fault is the job, not the worker in general.

Is it possible to write custom logic (preferrably in python) or tweak it to not lock workers as a workaround?

Regards,
Thorsten

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Customize AutoWrangling?
« Reply #1 on: December 03, 2012, 11:43:04 PM »
When __all__ the workers fail a job, the job itself is supposed to be marked as bad.  But if even _one_ of the job instances runs properly, then the supervisor thinks it's due to the workers.

Are you seeing this behavior when every single job instance fails, then all the workers involved are getting locked due to A-W?  If so, it's a bug, and we'll fix it.

And the short answer to your question is that currently the auto-wrangling logic is built into the supervisor, and not exposed to an external module, so there's no opportunity to tweak it.

You can disable it, but not sure that's what you want.

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Re: Customize AutoWrangling?
« Reply #2 on: December 04, 2012, 10:17:45 AM »
I see. That makes a lot of sense actually. The description of my case was not to the point i guess. The Problem was, that there was indeed a problem with the slaves. But it was related to the type of job. We run quite a lot of different types of jobs (ranging from nuke over max to a bunch of commandline tools and special inhouse tools). The job causing this was a nuke6.3 job with the reason beeing a slightly messed up setup of 6.3 on these workers. We usually maintain and handle slaves in blocks of 10s. So it was a block of 10 that was locked, 2 others went on.

The big Problem is, that these workers would have worked perfectly fine for all other jobtypes. If this happens late Friday evening a block of 10 might waste power being locked, because a jobtype is fubared vs. the Block of slaves.

Hence the wish to be able to tweak iit or at least prevent worker locking. If i were to customize it my ideal solution would be something like removing a specific job resource from the worker, so it does not join jobs of the same type but is still available for other jobs.

Regards,
Thorsten

jburk

  • Administrator
  • *****
  • Posts: 493
Re: Customize AutoWrangling?
« Reply #3 on: December 04, 2012, 03:21:08 PM »
Hmm...  Speaking hypothetically, I guess it might be good the be able to dynamically manipulate a worker's ability to run jobs  of a certain type, but I'm thinking you'd have to use the jobtype property to differentiate between jobs. 

I need to test whether the use of the worker_jobtypes parameter can be used to mask out an installed jobtype for a given worker, but even if we did this, it would be un-obvious bordering on magic why a worker was suddenly not accepting only certain jobs. Users would only know why if they were on the auto-wrangling mailing list, whereas a locked machine is pretty obvious.

Were the nuke jobs cmdline jobs?

instinct-vfx

  • Full Member
  • ***
  • Posts: 16
Re: Customize AutoWrangling?
« Reply #4 on: December 05, 2012, 01:14:29 PM »
I am using the Python API to construct a SimpleCmd Job from python and submit that. They turn up as cmdrange jobs then. We are still investigating which types of our jobs we want to wrap in custom jobtypes to provide additional functionality without a mile long commandline and no insight whatsoever heh.

Regards,
Thorsten

mim

  • Full Member
  • ***
  • Posts: 14
Re: Customize AutoWrangling?
« Reply #5 on: December 18, 2014, 10:06:19 AM »
When __all__ the workers fail a job, the job itself is supposed to be marked as bad.  But if even _one_ of the job instances runs properly, then the supervisor thinks it's due to the workers.

if i understand this correctly at the moment if a frame sequence (e.g. 1001-1600) starts out fine (frames 1001-1200 are completed) and the rest of the sequence starts failing supervisor will block the workers and not the job?

in this case would it be advisable to increase "aw_job_migrate_max" settings in order to get frames >1200 failing on all machines so that the supervisor gets it that the job is the problem not the workers?

regards
mikko