Author Topic: tcp/udp socket timed out (Read 16385 times)

sosborne76 · « **on:** November 15, 2011, 11:48:14 AM »

I have a render farm that is generally functioning well. But on occasion when there are many jobs in queue (some processing, others pending, and a new one that comes to the queue which starts processing with recently freed processors) we experience difficulties. I have seen a series of error messages like:

ERROR: unable to establish tcp connection with 10.0.180.1 - unable to connect to host.
ERROR: udp receive socket timed out.
[Nov 14, 2011 18:03:12] VENTUROUS[1812]: WARNING: work report attempt[58] failed; resending.
[Nov 14, 2011 18:03:12] VENTUROUS[1812]: WARNING: sleeping [5] seconds before retry.

I am not sure if the specfic conditions we experienced are relevant or if it is basically expected when dealing with large queues. Anyway I get the feeling that this is a symptom of the way that Windows 2008 R2 is configured and that 'port exhaustion' maybe occurring. But I was wondering if anyone else had come across this issue? Or have any advice on how to tweak the configuration/registry to make it handle the kind of loads a busy queue can exact on the tcp sockets? Should I be doing this on worker and supervisor or supervisor only?

Our render farm has 20 workers. The farm is running version 6.2.1 of Qube and Windows 2008 R2 SP1.

dmeyer · « **Reply #1 on:** December 20, 2011, 02:20:45 PM »

Is your supervisor and filer on the same machine?

sosborne76 · « **Reply #2 on:** January 20, 2012, 02:22:08 PM »

I assume when you mention 'filer' you are referring to a server handling file transfers in this whole process. To be honest I only have a supervisor and it handles everything. I was not aware that there was another server role within Qube, how it works or how you configure it. Please enlighten me and let me know how it will benefit the situation as this has been a problem for a long time off and on.

In my own efforts to overcome it I created the registry key -

[HKEY_LOCAL_MACHINE \System \CurrentControlSet \Services \Tcpip \Parameters]
MaxUserPort = 20000

on my Win 2008 R2 SP1 supervisor yesterday. I don't think it has helped.

dmeyer · « **Reply #3 on:** January 20, 2012, 02:25:03 PM »

Yes, 'filer' is short hand for file server.

Where are the files that your farm workers are accessing to render? Scene files, textures etc.

Are they on the same machine as the supervisor?

sosborne76 · « **Reply #4 on:** January 20, 2012, 03:46:48 PM »

No the files are not. I have a NAS server on which users dump their projects, also the final images are left there too. Also I have the shared Qube logs stored on a separate share on the same NAS server.

I have found in the past these udp messages primarily relate to the supervisor, and there have been warnings from the workers mixed in amongst them. A further development recently have been conversesubsupervisor warnings in there too.

ERROR: max connections reached for 10.0.180.1 - connection rejected.
ERROR: unable to receive from 10.0.180.1 - tcp receive socket timed out.
[Jan 20, 2012 12:07:29] VENTUROUS[2780]: WARNING: converseSubSupervisor() attempt[1] failed.
[Jan 20, 2012 12:07:29] VENTUROUS[2780]: WARNING:: retrying in [7] seconds.

Author Topic: tcp/udp socket timed out (Read 16385 times)

sosborne76

tcp/udp socket timed out

dmeyer

Re: tcp/udp socket timed out

sosborne76

Re: tcp/udp socket timed out

dmeyer

Re: tcp/udp socket timed out

sosborne76

Re: tcp/udp socket timed out