How to force a relay selection when the current parent relay is not working properly

fermt · May 31, 2023, 1:50pm

Overnight, we got a relay that ran of out space and the clients using that relay were not able to receive any action.
We were seeing the error FAILED to Synchronize - Server returning old data, try again later. in the clients’ logs. I assume it is because the relay could not fetch the latest mailbox version.
The clients continued using the problematic relay and no actions were performed by the agent, this caused our sys admin to perform several manual actions in order to get the servers patched last night.

We would hope that the clients will automatically switch to another relay when the current relay is not working properly(In this case the relay service was online all the time but unable to process content) so actions can run without any issues but that was not the case last night.

Would the setting _BESClient_RelaySelect_ResistFailureIntervalSeconds ensure that a new relay selection is run in a case like this?

As I mentioned the relay never went offline, it was just refusing new content/registrations due to the disk space issue, so the connected clients were seeing the relay as available.

vk.khurava · May 31, 2023, 2:32pm

We have encountered this issue on thousands of our relay and client devices; you must enable some checks and take the necessary corrective action.

Determine which Relay and client are affected by it. It is preferable to construct an RP to go through client logs for that problem statement; if I were at my computer right now, I could have shared it.
Relay reset (gather state) is required if the relay is the cause of the issue; child relays should be corrected before parent relays.

There are two options if the issue just affects BESClients:
Both of these actions must be carried out while BES services are in the stopped mode:
a. Delete the __BESData folder
OR
b. Delete the file __BESData\sitedata.db.

We also voted for below idea for some self heal procedure should be added in BESClient to auto deal with such issues.

https://bigfix-ideas.hcltechsw.com/ideas/BFP-I-161

fermt · May 31, 2023, 2:43pm

I have identified the root cause(Disk ran out of space) and fixed it. However, I would like that the clients automatically switch to a different relay when they are not able to receive content or post content to their parent relays.

JasonWalker · May 31, 2023, 3:38pm

I don’t think there’s a good solution for this at the client level; “Server returning old data” is usually a retryable error and I think client shouldn’t make assumptions about whether the Relay is going to recover on its own.

There’s probably a case to be made for the Relay to either self-heal or perhaps to shut itself down if it cannot recover (like when the disk is full). If it shuts down then the clients would fail over to the remaining relays after their ResistFailureInterval passes.

vk.khurava · June 1, 2023, 8:19am

yes, in some cases it will be auto fixed but this is kind of an error which require monitoring in the infra, we know the pain because of it we lost approx. 5K devices & several team had to do manual efforts to get them online again.

We also raised the case with HCL product support but there was no fix, its just delete the __BESData folder & do the manual effort.

SLB · June 1, 2023, 11:40am

Maybe as an interim workaround an approach could be to have a fixlet that will stop and/or disable the relay service if the relay disk has below a certain threshold of free space. Possibly also add a client setting to “tag” the relay in someway, to record the fact the service was stopped/disabled due to low disk. This then brings visibility via web reports. At least with the local relay service stopped, the BES client on the relay would use a remote relay and should still be able to be managed with actions to address the issue so the relay service can be re-enabled.

fermt · June 1, 2023, 5:01pm

That was an approach that I was considering but I wanted to do some research on any built-in capabilities that could avoid getting the clients in a forever retrying state.

We have some external monitoring though and by coincidence this was the only relay that was not being monitored hence nobody was aware of the disk space issue until the downloads got stuck.