Hi,
I?m running into situations where I have a large number of tasks and the tasks are completing fast enough where the first ?few? workers complete most of the tasks. As a result a large number of workers aren?t being assigned work and the job is being reported as 100% but with a status of pending/running. When I look further into it, there are a large number of workers that are pending, and it seems to take a great deal of time to cycle through these workers that are seemingly doing nothing but at the same time they are using up CPUs.
This is causing problems for two reasons: Firstly, it is using up CPU time for empty workers, preventing that CPU from being used elsewhere. Secondly, it is preventing a complete task from actually being completed; thereby preventing any task that is dependent on it from starting.
I am currently evaluating Qube, and as a result I have only a fraction of the farm to test with, however it is possible for this kind of situation to appear when the farm is under large amounts of stress, and only a handful of CPUs are available for use.
Thanks,
Zameer