Scaling relays, Linux vs Windows

atlauren · March 28, 2018, 6:36pm

Regarding relays, the current guidance is 1000:1 clients:relay; the roadmap at Think 2018 indicates we’ll see updates for up to 5000:1.

For those deploying relays on different platforms, do you see different practical limits on the effective workload?

Our core server is Windows, but most of our clients connect to CentOS7-based relays. We’re finding that the 1000:1 ratio is a firm limit on Linux; when relays approach that limit, “weird” things start happening, such as a machine appearing offline in the console when it’s known to be online. Often sending a refresh will make it reappear, but sometimes not.

Is this limit in fact constrainted by the root user’s ulimit?
http://www-01.ibm.com/support/docview.wss?uid=swg22006914

I’m reminded of another product I administered on Solaris and later Linux, which gave guidance on kernel parameters for a given sizing scenarios. I seem to recall that were thinks like open files and memory handlers, often scaling well beyond what was normal for a process running in userspace. Perhaps IBM should do something similar for dedicated relays.

jgstew · March 28, 2018, 6:44pm

There are some guides for this: http://www-01.ibm.com/support/docview.wss?uid=swg27048326

but there should be more bigfix content to do some of the tuning for dedicated relays provided.

There is potential to make it scale better by tuning OS limits on:

ephemeral ports
open sockets
TCP wait time
open file limits
and more

AlanM · March 28, 2018, 6:47pm

from the capacity planning guide (which isn’t directly applicable but is related):

Linux ulimit Management
For Linux operating system defines a system ulimit for the maximum number of open files allowed for a process (i.e. the nofiles option when you run the command “ulimit –a”). For the DB2 instance, the value for this kernel limit should be either “unlimited” or “65536”.

The relay does keep a lot of file handles open for reports, gathering, serving files etc.