Increase Relay Capacity

cstoneba · December 16, 2014, 8:14pm

We all know that IBM recommends only having 500-1000 clients/Relay. Does anyone go beyond that but try to maintenance the same client performance? Maybe try and get 2000+ clients/relay?

It seems that the bottleneck at the relay level isn’t resources but available network sockets. For windows, the default time_wait is 240 seconds, and for Linux it is 60 seconds. Also, the amount of ephemeral ports available differs between Win and Linux, but the number of available ports also plays into the number of allowed simultaneous inbound connections.

So in theory, decreasing the TIME_WAIT and increase the number of ephemeral ports should allow a Relay to accept more Clients, correct?

TimRice · December 16, 2014, 9:01pm

With the switch to 64bit Relay code in v9.2 do you think that will have any impact on performance?

cstoneba · December 16, 2014, 9:38pm

I wouldn’t assume so. Because the BESRelay service is 64 bit, i don’t think that changes the interaction with the tcp/ip stack.
But i could be wrong.

What I’m looking for is a way to detect via monitoring when a Relay’s available sockets have been exhausted, and clients aren’t able to connect because of it.

sbl · December 16, 2014, 11:34pm

Though not answering your question directly cstoneba…
I have relays which hover around 1500-2000 clients depending on the the given day of the week and the alignment of the stars. Performance in general is acceptable range of within 10 minutes or less, generally slow downs or delays are network work conditions, distance the client is from relays, client responsiveness and how much you are asking your client to do bigfix related (actions, analyses, etc).

Our use case is also though it may say 1500 clients reporting in, in reality maybe 75%(+/-) of the clients are active.
I did at one time though not intentional had a relay have 10,000 clients report to it (oops that was fixed).

I also have a F5 load balancer configured in front of 4 of our relays for load balancing of clients which want to always latch on to the main server but I tell them to go away with a fixlet and talk to this fail over load balanced name.

Not sure what your trying to solve or just do more with less but hope that helps.

jgstew · December 17, 2014, 3:45am

It is true that the main issue with relays if you are using decent hardware is sockets and ephemeral ports.

We currently have a relay that typically has around 3000 clients registered to it. There might be some performance issues, but not that we are specifically aware of. It would be helpful to have an idea of what the actual impact is, and a good way to measure the effects of having over 2000 clients on a relay.

We have been meaning to decrease the TIME_WAIT and other OS tweaks to help with performance of a high number of clients on a single relay, but it would be nice if there was more documentation on this and some tasks to help automate this.

GwyndafDavies · December 17, 2014, 2:13pm

I’ve had multiple relays with over 10K clients reporting in (about 10,700 at it’s peak). To start with, I had major concerns around implementing this so I conducted tests to verify the relay could handle the capacity. I also made setting changes to ease the load on the relay. The only functionality I was fulfilling was general IEM maintenance, collecting initial software scan data and then collecting capacity data every 30 mins. So in essence I was delivering and receiving quite a bit of data. Clients were set up with Manual Relay selection with primary, secondary and a Fail-over Relay.

The relays in question were dedicated top level relays (approx 4 cores, 8GB RAM on RedHat VM’s).

To try and minimize the impact on the relay I changed:

_BESClient_Report_MinimumInterval from 15 to 60 seconds

Changed Client Heartbeats from 15 mins to 30 mins

Ensured I had no Analysis running more often than 30 mins/every report

And I also increased the time for retries for clients during relay selection when they get a refused connection.

The observed behaviors of the relays was actually very normal. I did not experience relays under any particular load - to be honest it wasn’t even touching the sides in terms of CPU and RAM. Disk I/O and Network traffic were the most observable aspects although it still wasn’t causing excess load.

However, what you need to avoid:

If your sending out an action to all the clients that are reporting into the relay, stagger it over time. As you said,the actual limitation is the number of simultaneous connections, you want to avoid having clients trying to report back at the same time. If you stagger actions, then the clients will gradually report back in an orderly fashion and you won’t reach the max connections limit. This way a Relay can easily service a large number of clients.

Otherwise what you may observe is clients responding slightly slower when you deliver actions - due to the relay being busy servicing it’s maximum number, before moving on the the next bunch, or clients failing over to another relay.

A good place to check the load on the relay is by actively looking at your relay diagnostics page (http://RelayHost:52311/rd), paying particular attention to FillDB:

If this goes over 100%, the relay will be quite busy, however it should still eventually get through all the reports.

The only time I had an issue, was when a relay in a lower bandwidth area couldn’t process and pass on the Upload Manager Data fast enough as there was data coming in faster that the relay could pass it on the the IEM server. Basically the network connection was too slow so the data wasn’t going up fast enough. This meant that the Buffer Directory Max Count and Buffer Directory Max Size was getting bigger and bigger. However, even when both were 20x the default size, FillDB was still working ok and the relay was responsive enough to allow me to changed the settings for those clients (increase capacity scanning interval to 2 hours instead of 30 mins).

In terms of relating this to maintaining the performance, I would suggest that a relay could easily handle upwards of 2-3K clients without having to change the out of the box settings. Changing heartbeat, even to 20 mins, will help you though as it gives the relay more time to process the reports.

If you really want to push the limits and have faster performance, then you can look at adjusting the post results size and count.

_BESRelay_PostResults_ResultCountLimit

_BESRelay_PostResults_ResultTimeLimit

Decreasing both values will improve response time, but increase Network Traffic.

And of course increase the Max connections, however depends on what the limits are for the particular OS.

cstoneba · December 17, 2014, 5:31pm

Thanks for the info. My goal is to just have a relay support more than the default 1000 clients as we are starting a large client deployment and the number of new Relays needed, the better.

I’m finding it challenging to detect port exhaustion on relays because the clients create the connection and then the connection closes so quickly. But here’s some commands I found that give the current connection counts, but they spike constantly so the data isn’t hugely useful:

Even on a Win Relay with 1094 clients, netstat result is showing ~50 - ~80, with all clients having the default 15min heartbeat. That makes me think this particular relay can handle many more clients that the current load.

jgstew · December 17, 2014, 8:01pm

As long as you don’t have slow links or slow disks, then I would not be concerned until you go over 2000 clients per relay.

I think the more conservative number of clients per relay (100 to 1000) is more of a concern if you are using random desktops as relays, or the relay server is doing other things besides being a relay.

jgstew · December 17, 2014, 8:16pm

The main thing this would help with is the socket/port/connection issues, but you could run multiple Relay VMs on a single host. One of the main issues with this approach is the relay cache will not be shared as if you had the host act as a single relay for the combined clients that the VMs would have. It would be interesting if there was an easy way for multiple relays to share a cache in this configuration.

GwyndafDavies · December 17, 2014, 11:45pm

@jgstew - I don’t think it would be too hard to set up shared cache. If you place the wwwrootbes on a shared directory and then create the appropriate symlinks on your relays.

As for your comment on the relay numbers, I completely agree. The estimate of maximum 1000 clients is conservative based on the fact you could be using the minimum requirement (Pentium III, 512MB RAM etc) an itself could be a shared computer.

JasonWalker · December 18, 2014, 4:38am

I’ll try to remember to send a task for decreasing TIME_WAIT and increasing ephemeral ports to bigfix.me tomorrow. I have a fixlet to do this now. Microsoft does have a document on it, based on increasing capacity for Exchange Server.

jgstew · December 18, 2014, 4:45am

Symlinks is an option I was thinking about to do a shared relay cache. I still wonder if it would be a good idea or how well it would work, especially when it comes to potential collisions of a new download on multiple shared relays happening at the same time and when things are deleted from the cache rolling over.

JasonWalker · December 22, 2014, 4:27am

Sorry it took a while, but I just posted my TIME_WAIT fixlet to bigfix.me: https://bigfix.me/fixlet/details/3937

…it’s a simple reg edit, easily removed if necessary, but I’ve had no ill effects and have used this on my servers for years.

sbl · January 7, 2015, 3:39am

We run multiple relay VMs on a single host. This provides redundancy in the relay infrastructure in case one or more goes down in event of failure or if you are phase patching your relays one at a time.

Each VM has a dedicated SSD and dedicated 1 GB NIC port.

The server has 16 disks with 3 quad port nics + 2 on board NICS.

JasonWalker · December 1, 2016, 10:02pm

I received a PM on this, so in case anyone’s still looking for it the fixlet is at https://bigfix.me/fixlet/details/3937