Relays not switching to secondary when primary top-level relay is offline

MB43 · January 11, 2019, 4:55pm

I have 3 top level relays, one in each region (NALA, APAC, EMEA). Each has the Affiliation Advertisement list set to “TopLevel” so none of the clients will connect to them.

The second-level relays are manually configured to use the top-level relay in their region as the primary, and one in another region as the secondary. None of the second-level relays have an Affiliation Seek list defined since they’re using manual relay selection.

Today the APAC top-level relay went offline but none of the second-level relays have switched over to the top-level relay that’s defined as their secondary relay. I thought that they would switch after 10 minutes but after nearly an hour I see them still trying to connect to the primary in the APAC region.

What am I missing in this configuration? Is there something else that needs to be configured on the second-level relays in order for them to switch over to a different top-level relay?

JasonWalker · January 11, 2019, 10:47pm

Can the second-level relays reach the remaining top-level relays, on both ICMP and tcp/52311 ? It’s been my experience that even in manual relay select, the child attempts to ping the relay before selecting-over to it.

If ICMP is not allowed and cannot be added, you should look at configuring _BESClient_RelaySelect_FailoverRelayList on the second-level relays. This gives you list of parent relays, that will be tried in order, when none of the configured relays respond to ping. I believe the intent of the setting was to help clients initially find and register to a relay when the root server is not accessible from the client, but helps in any case where ICMP is blocked.

Also check the value of _BESClient_RelaySelect_ResistFailureIntervalSeconds on the child relays. It defaults to 10 minutes.

https://www.ibm.com/support/knowledgecenter/SSQL82_9.5.0/com.ibm.bigfix.doc/Platform/Config/r_client_set.html

MB43 · January 12, 2019, 1:58am

I verified that the relays that did not switch are able to connect to the top-level relays in the other regions with both ICMP and TCP/52311. I’ll look into defining _BESClient_RelaySelect_FailoverRelayList on those second-level relays anyway.

The _BESClient_RelaySelect_ResistFailureIntervalSeconds setting does not appear to be configured on any of those relays.

fermt · January 14, 2019, 1:32pm

Do you have enabled IPv6 by any chance?
I’ve seen similar behavior when IPv6 is enabled, the client won’t select the relays just because the IPv6 implementation is different on the network level.

MB43 · January 14, 2019, 2:00pm

IPv6 was enabled on 2 of the 3 top-level relays (I’ve disabled it) but was not enabled on any of the second-level relays.

AlanM · January 14, 2019, 8:16pm

The clients control when the switch would occur. So it may depend on how often your client is attempting to report as that will trigger the relay selection to occur earlier than the long interval of re-registration.

How often do your clients report on this relay? The logs of the client should show what is happening.

MB43 · January 14, 2019, 11:13pm

I checked the logs on one of the second-level relays, looking for “report posted successfully” and it looks like that occurs once an hour. Is this interval configurable? What’s a good recommended value for a relay?

Tomorrow I’ll disable the BES Relay service on the APAC top-level relay and leave it for a few hours and will see what happens.

AlanM · January 15, 2019, 12:25am

Its the client that is making reports so it depends on your content stream. The minimum report interval is a setting but maximum is however long it takes to get through your content.