Bear with me while I try and explain the behavior we're seeing.
Qube 5.5.1, Maya2009 running on RedHat Enterprise WS R4
Render nodes read data from local drive and then write frames to a MetaLAN 4.2.1 mounted SAN drive array.
The rendering was stable and reliable when our output files were IFF (2MB). We recently switched to EXR format (24MB) and the behavior has become very erratic. Renders will work ok for some time (1 hour, perhaps 2 hours) and then the jobs sit with no progress.
The MetaLAN mounted directories show a small (128 byte) file for the stalled renders, which I assume is some part of the EXR header. The permissions on the files should be root:root, but sometimes they are qubeproxy:qubeproxy. The MetaLAN mounted device appear to be stable and accessible, but this error typically appears in one or more of the stderr logs:
INFO: testing output directory for [images]
WARN: output directory [/mnt/SAN0/sg_mayaRender/tests/test06/chunk_0009_shotCam4_1K185/] does not exist... attempting to create it...
ERROR: cannot create output dir [/mnt/SAN0/sg_mayaRender/tests/test06/chunk_0009_shotCam4_1K185/]
ERROR: SUPER::initialize() at /usr/local/pfx/jobtypes/maya/UniversalMayaRenderJob.pm line 112.
ERROR: in initializing job at /usr/local/pfx/jobtypes/maya/MayaJob.pm line 206.
INFO: reporting status [failed] to supe: qb::reportjob('failed')
maya::MelProcessor::DESTROY
maya::MelProcessor::finish
INFO: HARNESS=[IPC::Run=HASH(0x1114e00)]
INFO: exiting from maya
quit -f
That directory clearly exists and has many output frames already written to it correctly. For whatever reason, it seems that the directory is unreadable and that causes one render to fail, which seems to cause a domino effect with the other 3 render nodes. They sit stalled indefinitely until we take some manual intervention.
I realize that this is a complex environment, and my feeling is that the MetaLAN is the likely culprit, perhaps something to do with the large size of the files (24MB+) or the timing of the I/O. I am going to remove that piece from the puzzle and see if local writes improve reliability.
Sometimes we can kill/retry the stalled jobs and processing will continue ok. Other times we need to restart the worker, and still other times it requires a shutdown/reboot to get the rendering working again.
Any other thoughts or ideas would be appreciated. Thanks.
Steve