How to find an Operator's opsite? (OR, how to fix what I've done?)

straffin · February 20, 2023, 9:59pm

Long story short, in an effort to clean up unused and abandoned items in our configuration, I deleted a few operator sites a while ago. Looking for somewhere to possibly grab information about computers checking in to the server, I took a look at our server’s BESRelay.log file and was greeted by a continuous stream of “Unable to find site id for URL” errors, as many as 5 per second, all for one of two “/opsite###” sites. I cannot find a way to identify who these (long deleted) sites belong to (the /operators API endpoint makes no mention of this). I’m hoping that deleting the Operators might delete the association between their computers and the (long deleted) operator sites.

OR, is there another way to address these errors?

straffin · February 20, 2023, 10:07pm

And, of course, once I hit Submit, I find them listed in the URLs under Operator Sites. However, more questions arise…

One of them is not listed, so I still can’t tell what Operator it belonged to.
The other of them is MINE. I’m still here, and my site is listed. There’s naught in it but an empty Manual group, but it’s there.

What now?

JasonWalker · February 20, 2023, 10:17pm

How did you delete the sites? Via the console, or by editing GatherState and bfemapfile.xml?
Or do you mean you deleted the operators associated with the sites?

Also - is your operator a Master Operator account? If it was previously a non-master operator, it had an opiate, but after promoting to Master Operator your account stops using the opposite and uses actionsite instead, but clients continue to gather your old opsite

straffin · February 21, 2023, 12:48am

It was a while ago, but I believe I deleted them in the console by selecting the site and removing it that way. I definitely did NOT edit GatherState and bfemapfile.xml. I may have also deleted the operator associated with one of the error states…the other error state is me.

My account is an MO and (as far as I can recall) it has always been an MO account. It may have momentarily been non-MO as it was created as an LDAP account, then made MO, but that’s been well over a decade ago. The opsite for my account exists, but it is showing in the log as an error state:
Mon, 20 Feb 2023 16:03:55 -0500 - /cgi-bin/bfenterprise/besgathermirror.exe/-siteversion (12345) - Unable to find site id for URL: http://our.server.name:52311/cgi-bin/bfgather.exe/opsite241

So, one error state makes sense and should be cleaned up. How?
The other error state shouldn’t be happening… the site is there. Now what?

anademayo · February 21, 2023, 6:52am

There is a way to clean up a single problematic site as documented here: How to do a gather state reset on a single site

As for your site that is having errors, have you tried creating a blank task under it to see if it propagates?

trn · February 21, 2023, 5:25pm

You can list the current operators and their sites using session relevance

(
	name of it , masthead operator name of it , 
	(
		if
			master flag of it 
		then
			"MO" 
		else
			url of operator site of it 
	)
)
of bes users

You can then look to see if those sites belong to a current operator. If they don’t then doing the gather state reset suggested by @anademayo may help (run from the leaf relay upwards) but a client that believes it needs the site will continue to attempt the downloads.

straffin · February 21, 2023, 7:50pm

What do you use to deploy session relevance? I’m currently using the REST API for that but I have a feeling there are other (maybe easier) tools available. I can’t get the old IBM “BES Session Relevance Tester” to work.

JasonWalker · February 21, 2023, 8:42pm

For Session Relevance, I’d either use the Web Reports QNA page, or the Console Debugger.

For Web Reports, change then end of the URL to /webreports?page=QNA

For the Console, press CTRL-ALT-SHIFT-D to display the debug popup, then check the box at the top for “Show Debug Menu”. That adds the Debug option to the console, at the top (same level as the “File” menu). Under Debug->Presentation Debugger you can run session relevance

trn · February 22, 2023, 7:45am

Personally, I still use the Session Relevance Tester (https://support.bigfix.com/labs/relevanceeditor.html)

It is getting a bit long in the tooth now, and would benefit if it was shown a bit of love, but i find it very handy

If I code everything up to wrap the results with the html tags it wil render the html - then plug the code into a custom web report with some css to make it look pretty

straffin · February 27, 2023, 11:07pm

I’d love to still use the SRT, but we’ve enabled SAML Authentication on our Web Reports server and I can’t connect that way with the SRT. It also throws an error stating "The provided URI scheme ‘https’ is invalid; expected ‘http’., so there’s that as well.

trn · February 28, 2023, 7:50am

Hmmm - we don’t (yet) have SAML on our WR server, but SRT seems happy enough with https

JasonWalker · February 28, 2023, 3:17pm

As for the original problem…Server/Relay logging error when trying to gather a site that no longer exists.

I’ve done quite a bit of research on this in some large customers with long-running deployments and have seen this pop up a number of times.

Something to note is that the log message is really the only impact. When it’s just a few sites noted, and the sites don’t exist, it’s safe to ignore the message. I’d actually wish we could suppress logging the message or at least make it a “softer” message because it’s not usually indicative of a real issue and usually no action needs to be taken on it.

But if it’s a huge number of sites, or the messages are causing grief / making it harder to identify real problems, there are some things we can do.

There are usually two factors to consider -

There may still be clients attempting to gather the sites. This is actually a more rare case, but we can have clients that never unsubscribed from a site that was removed, and keep asking their Relay to gather the site for them. The “documented” way to fix this is to remove the __BESData folder from the client and allow the client to reset to a clean state. That’s fine if it’s a small number of clients, and you’re ok with the client resetting (and potentially re-running older actions that are still open, and if you can identify the clients to begin with)
There may be no clients still trying to gather the sites, but Relays try to refresh the site versions at every startup, for any site that any client has ever tried to gather from the Relay. Deleted sites never get cleaned up from the Relay by default. The documented way to fix this is via a Gather State Reset on the Relay. This needs to be done in a specific sequence, from the lowest-level Relays first up to the top-level Relays, because a child relay requesting the site would put it right back into the parent Relay’s gather list.

There are some undocumented settings that you could work with Support on using to try to clean up the site gathers, short of performing full gather resets, but because there are some risks involved they’re not documented publicly.

JasonWalker · February 28, 2023, 3:34pm

One more thing to add - the question comes up “how does this happen”?

I’m not sure of all the potential root causes, but I’ve seen lots of reports about it, obviously enough reports that we have the whole Gather Reset process documented. The ways I’ve been able to reproduce it on-demand involve rolling a root server, relay, or client back to an earlier Snapshot or restoring backed-up database, i.e., backup the database, create a site, subscribe clients to it, then restore back to the database version in which the site did not exist. Now the clients have a reference to sites that the server doesn’t know about, and the server/relays will continue to produce gather errors when clients try to gather the site.

straffin · February 28, 2023, 3:40pm

So, potentially every user that thinks setting up a macOS machine from a Time Machine backup is a good idea.

I hate that feature.