Relays not switching to secondary when primary top-level relay is offline revisted

MB43 · February 2, 2022, 1:01pm

A while back I posted about an issue I was having with our relays not switching over to the secondary/tertiary top-level relay with the primary is not available. I thought I got it working properly but after one of our top-level relays went offline this week I see that it still isn’t working properly.

We have 1 top-level relay in each region - North America, Europe and Asia. There are second-level relays scattered throughout those regions, each is configured for manual relay selection and have the top-level relay in the region as the primary and one in the next region set as the secondary. Also both _BESClient_RelaySelect_FailoverRelayList and _BESClient_RelaySelect_TertiaryRelayList are set on each second-level relay as “localregion;secondregion;thirdregion” (with the actual top-level relay names, obviously).

The top-level relays have _BESRelay_Register_Affiliation_AdvertisementList set to “TopLevel”, none of the client have this in their _BESClient_Register_Affiliation_SeekList so none would connect to them.

This part works properly, each relay within the region connects to the top-level relay within that region as the primary and the clients connect to the closest second-level relay within that region.

But when that primary top-level relay is offline, the second-level relays should switch over to one of the others but that’s not happening. To test this, yesterday I took the Europe top-level relay offline, the second-level relays in that region kept trying to connect to it over the next 4 hours and never switched over to another one.

When the primary top-level relay is unavailable I’d like to see each second-level relay switch over to the secondary within 30 - 60 minutes. What exactly do I need to do to make this happen reliably?

MB43 · February 11, 2022, 7:32pm

Nobody has any information about this?

JasonWalker · February 11, 2022, 9:11pm

I’d perform a connectivity test and ensure that each second-level relay can actually reach the secondary, with both ‘ping’ and by downloading https://relay:52311/masthead/masthead.afxm from the secondary relays.

Next check the client settings page at https://help.hcltechsw.com/bigfix/10.0/platform/Platform/Config/r_client_set.html#r_client_set__regs and see whether you have any of these set on your child relays, and what their values are:

_BESClient_RelaySelect_IntervalSeconds
_BESClient_RelaySelect_ResistFailureIntervalSeconds

The only way I’m aware of to force a Relay to failover more quickly is to reduce the _BESClient_RelaySelect_IntervalSeconds value. By default it’s set at six hours. If the upstream Relay is no longer reachable, the local relay will continue caching and buffering reports from its own client and from the clients that report to it. On a heavily-used relay, if the buffer dir fills it reports that error to its own client which will trigger a new relay selection, but if the relay is not heavily used or the max buffer dir size is set high, the relay would only perform a new selection when _BESClient_RelaySelect_IntervalSeconds expires.

There is a definite performance overhead to a Relay failing over quickly - the client registration list for all of its downstream clients, is updated on all of its upstream Relays and on the root server itself. In most cases 30 minutes may be too short a failover, especially in these days when a Windows patch may take longer than that during a normal reboot. But if you want to tune a setting and observe results, _BESClient_RelaySelect_IntervalSeconds is the one to work with on the child relays.