BESRelay log and Opsite errors

pcpeteusa · August 21, 2020, 3:50pm

Hello, Trying to understand why the BigFix 9.5.15 Windows server is generating 20 MB of opsite errors a minute. The error message is “unable to find site ID for URL: http://mybigfixserver:52311/cgi-bin/bfgather.exe/opsite and then a number”

To determine the opsite number I used some session relavance (name of it, URL of it) of operator site of bes users Some of the operator site numbers where for accounts that no longer existed in the console, many accounts did and some where connected and in use.

The log file keeps growing and growing and gets to 51 MB a number of times in a given day. Any idea what could be going on. SQL is installed locally.

JasonWalker · August 23, 2020, 1:47pm

It’s unusual to generate that much log data, but cleaning up obsolete sites on a relay is a pretty common issue we refer to as a Gather Data Reset. Unfortunately it’s not entirely trivial at the moment but I’m working on some content I hope will improve it.

Two conditions can actually occur -

Clients still subscribed to the obsolete sites. When the operators were removed, an Unsubscribe action is automatically generated to remove the opsites on clients. However if something like a server restore or snapshot rollback occurs, the server may have removed those unsubscription actions and some clients may still be requesting the sites.
When a client requests the site, each Relay in the chain up to the root tracks that some client wants the site. In the future (forever), the Relay will try to connect to the given site at Relay start time to see whether there are any new versions of the site. The “Unsubscribe” action that was sent to Clients, does not apply to the Relay caching mechanism. The only way to remove those references and stop the Relay from trying to cache the site, is to perform the Gather Data Reset procedure on the Relays.
Because each Relay caches these client requests, a Relay Gather Data Reset needs to start on the bottom-level Relay tier, and then work your way up the Relay chain. If you started from the top-tier and worked your way down, these obsolete sites would just get re-cached again on the top-level as soon as a child relay requested it.
If you are running the Relay HealthCheck, especially if you are running it frequently, that might cause more of these log messages than usual. I’m pretty sure we don’t recommend the Relay HealthCheck anymore; or if so, to make it very infrequent, like “only at Relay startup” or maybe “once a week”. I’ve seen cases where the Relay HealthCheck is configured for hourly or daily, and this can generate a lot of workload on the Relay, increase the error logging, and even prevent normal gathers from completing.

The procedure for performing a Gather Reset for Relays is at https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0023994 and for the Root Server at https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0079078

pcpeteusa · August 24, 2020, 1:09pm

Thank you for this information. In my opinion it seems the tail is wagging the dog. A client should be able to cause all of these issues. If a client submits a request for a site that no longer exists it should be told to unsubscribe/remove the site. Relays should cache these requests indefinitely. May these request should time out after 8 to 12 hours.

ttheierl · October 1, 2020, 6:11pm

@jasonwalker, on point #4 when did you stop recommending the use of the _BESRelay_HealthCheck settings? I have mine set to _Enabled and _EnabledonStartup but I’ve noticed since reading this post, when verbose logging is enabled, that I have log entries for health checks occurring about once an hour without and an interval being set. Is that the default? If so, I would like to change it to once a week, monthly or disable completely if that is the general recommendation.

What are the parameters for “_BESRelay_HealthCheck_Interval”. The description for it was available in the IBM Support documentation but it is not in the HCL Support settings docs.

Thanks!

JasonWalker · October 1, 2020, 6:15pm

I’d have to defer to @AlanM on that. I know the health check is considered deprecated, I don’t know what, if anything, it’s still intended to fix in current BES versions.

I think the _BESRelay_HealthCheck_Interval was in minutes but I’m not completely certain on that point.