Relays not switching to secondary when primary top-level relay is offline

I have 3 top level relays, one in each region (NALA, APAC, EMEA). Each has the Affiliation Advertisement list set to “TopLevel” so none of the clients will connect to them.

The second-level relays are manually configured to use the top-level relay in their region as the primary, and one in another region as the secondary. None of the second-level relays have an Affiliation Seek list defined since they’re using manual relay selection.

Today the APAC top-level relay went offline but none of the second-level relays have switched over to the top-level relay that’s defined as their secondary relay. I thought that they would switch after 10 minutes but after nearly an hour I see them still trying to connect to the primary in the APAC region.

What am I missing in this configuration? Is there something else that needs to be configured on the second-level relays in order for them to switch over to a different top-level relay?

Can the second-level relays reach the remaining top-level relays, on both ICMP and tcp/52311 ? It’s been my experience that even in manual relay select, the child attempts to ping the relay before selecting-over to it.

If ICMP is not allowed and cannot be added, you should look at configuring _BESClient_RelaySelect_FailoverRelayList on the second-level relays. This gives you list of parent relays, that will be tried in order, when none of the configured relays respond to ping. I believe the intent of the setting was to help clients initially find and register to a relay when the root server is not accessible from the client, but helps in any case where ICMP is blocked.

Also check the value of _BESClient_RelaySelect_ResistFailureIntervalSeconds on the child relays. It defaults to 10 minutes.

https://www.ibm.com/support/knowledgecenter/SSQL82_9.5.0/com.ibm.bigfix.doc/Platform/Config/r_client_set.html

I verified that the relays that did not switch are able to connect to the top-level relays in the other regions with both ICMP and TCP/52311. I’ll look into defining _BESClient_RelaySelect_FailoverRelayList on those second-level relays anyway.

The _BESClient_RelaySelect_ResistFailureIntervalSeconds setting does not appear to be configured on any of those relays.

Do you have enabled IPv6 by any chance?
I’ve seen similar behavior when IPv6 is enabled, the client won’t select the relays just because the IPv6 implementation is different on the network level.

1 Like

IPv6 was enabled on 2 of the 3 top-level relays (I’ve disabled it) but was not enabled on any of the second-level relays.

The clients control when the switch would occur. So it may depend on how often your client is attempting to report as that will trigger the relay selection to occur earlier than the long interval of re-registration.

How often do your clients report on this relay? The logs of the client should show what is happening.

I checked the logs on one of the second-level relays, looking for “report posted successfully” and it looks like that occurs once an hour. Is this interval configurable? What’s a good recommended value for a relay?

Tomorrow I’ll disable the BES Relay service on the APAC top-level relay and leave it for a few hours and will see what happens.

Its the client that is making reports so it depends on your content stream. The minimum report interval is a setting but maximum is however long it takes to get through your content.