False Root and Gather Reset broke custom content

This has not been fun!

We had an issue last month and while working with support, they had us do a gather reset on the root server. I was a little leery of doing this but it did resolve the issue.

However, we have a false root. In our multi-tenant environment, we have very little control in customer environments and it is sometimes difficult to get endpoints to point to a local relay due to routing, and instead the point back to our BigFix URL, in our Masthead. Due to this causing thousands of machines to connect to our root server, we setup our top level relays as decrypting relays and the changed the IP of our BigFix URL to point to one of our Top Level relays. That relay has a hosts file entry pointing that URL back to our root server.

This removed the load on the root server and made things better and this has been in place for more than a year now. Everything was working fine, until the Gather Reset.

Now, we are getting custom content stuck in “Pending Download” with a 403 forbidden error. I thought doing a gather reset on the false root relay (using the fixlet) would resolve the issue, but it did not.

AND, this has not broken all custom content, just some of them.

Changing the URL to the IP of the root server resolved the 403 error.

So here we are. Anyone ever experienced this?

image

That…is pretty neat. I haven’t considered this scenario but I have some thoughts.

The Downloads should not necessarily be related to the Gather state. From a Downloads perspective, the Root / Relay is just acting like any old web server that happens to be running on port 52311 instead of the usual port 443. What could happen though is that the downloads didn’t need to send a request upstream, if they were already cached in the bfmirror/downloads/sha1 at some level of Relay, and those cached downloads were removed with your gather reset.

A 403 error is what we’d expect from a browser trying to download files from an Authenticating Relay. No client authentication certificate = Forbidden – and the client doesn’t try to send a certificate for a normal Download, just for gathering or sending download requests to a relay

Which leads me to ask, what kind of Direct Download settings do you have on the clients and/or their local Relays? Because the 403 makes me think they aren’t sending a “download request upstream”, it looks like something in the path is trying to do a direct download (like a curl download would behave).

See List of settings and detailed descriptions

On clients I’d be looking for _BESClient_Download_Direct, _BESClient_Download_DirectRecovery, _BESClient_Download_Direct_SubnetList, and _BESClient_Download_Direct_SubnetList, as well as the topic on Managing Downloads at Managing Downloads

On Relays in the path between the client and the root, check the values for _BESGather_Download_CheckInternetFlag and _BESGather_Download_CheckParentFlag

What I’d want to ensure is that if any of the Download_Direct settings are applied on the clients, ensure that they also have _BESClient_Download_DirectRecovery configured so that if the Internet download fails, they’ll then try to send the request to an upstream relay (and honestly I haven’t tried this scenario so I can’t be positive that a 403 response correctly triggers the failaback recovery)

On the Relays, I’d also want to ensure that the ‘CheckParentFlag’ is set to 1 on every relay in the path. Otherwise the download request might not ever make it as far as the top-level relay (that resolves the real name of the root server).

This may have been broken for quite some time, and you might not have noticed it if the ‘CustomScripts’ files don’t change often and had already been cached on the lower-level relays. The setup you have now sounds like a traditional fake-root, what we would have done before the ‘Last Fallback Relay’ option was supported in the masthead. Now it might be simpler to use the ‘Last Fallback Relay’ masthead option instead, configuring some name that resolves to the local relay at each customer; that way there are no DNS games involved in faking the root server name, and all the Relays, Consoles, Web Reports, WebUI, etc. that need to reach the root directly, can, without HOSTS file entries.

I am doing a little research on your suggestions and will get back to you. However, I think we only used _BESClient_Download_Direct for one customer, two years ago. I am creating an analysis to gather that and the other settings.

As for the Last Fallback relay, we tried that, it failed miserably and actually caused a ton of our fully remote systems to lose access to our server. See, the IP we used in the Last Fallback Relay setting was an internal relay. Remote systems did not have access to it and when they could not reach other relays, they fell back to the internal one to connect. With no access, the went offline. Not all of our customers allow internet access to their systems so using the external IP was not an option. We had to work with all of our customers to have their remote users connect to VPN to get those systems back online. And its not like you can just test it on some systems, that setting becomes part of the masthead.

With the URL in the masthead, i can have that resolve to one IP when it is a remote system and a different IP internally. Hence why we went to the false root. The only issue we had with that is having to train our NMOs to use the IP of the true root when opening the console, instead of the URL.

As I have said before, Mutli-tenancy is a Beach!

Forgot to mention, changing the URL to the IP in the custom content does resolve the issue.

So with the URL, the system thinks the relay IP is the root, but when we change the URL to the IP in the content, that points to the true root.

With more than 16,000 systems calling in, these 6 systems are the only ones with a setting and they are actually set to the default.

We use the EMSEID to identify the customers so these settings are across 4 different customers

I was just in the middle of writing a much longer response that involves sending the action just to the Root itself, and then to the Fake-Root, and continue to high-level Relays and work your way down until you find the failure, but I think this clue is probably crucial.

You listed the client settings related to DirectDownload, but I don’t think you mentioned whether you checked for _BESGather_Download_CheckInternetFlag and _BESGather_Download_CheckParentFlag on the Relays themselves.

If a Relay has the CheckInternetFlag set to 1 and the CheckParentFlag set to 0, then that Relay will not send download requests to its parent - it will try to service the download request itself. Which would mean a direct connection to what it thinks is the URL for the download…which for everything except your fake-root itself, would point at the fake-root host.

I presume your Fake-Root is configured for Relay Authentication? That would explain getting the 403 Forbidden response, rather than a 404 Not Found (since the Fake-Root also probably doesn’t have the ‘CustomScripts’ directory on it either).

Check for those two client settings on the Fake Root, and on the relays beneath the Fake Root, and then we can work it from there.

Some other thoughts -

  • Your Child Relays don’t need to depend on the Fake-Root configuration. You could have them manually-selected to the real name of your top-level/fake-root, and add the HOSTS entries so that for just those relays, the hostname of the root really does go to the root server’s IP. I’m presuming they can actually reach your root, since you said changing the download URLs to use its IP address does let the downloads work.

  • You could change the download URLs to not use your masthead name / fake-root name. In the past I’ve set up additional DNS aliases that also go to the root server, and used those aliases for the download URLs. The idea being that at some point I could move those downloads off to a separate, real Web Server, and as long as I redirected the alias for downloads I wouldn’t have to change the content. It’s also been handy in some DSA situations where I want the downloads to come from different DSA servers depending on where the client is coming from / which DSA server it’s reporting to.

  • You should be able to use the ‘Last Fallback Relay’ masthead option interchangeably with how you’re doing fake-root now. The ‘Last Fallback Relay’ does not need to be an IP address, in fact it’s much simpler if it’s a hostname; and then you can do the same DNS tricks you’re using now, to make that ‘Last Fallback Relay’ name resolve to one IP address for external clients and a different IP for internal clients (or to different IPs for each customer, directed to their local relay). The main advantages to doing that are you then don’t need to distribute HOSTS entries or train your operators to use alternate names or IP addresses, and it makes it a lot more clear from the Console which clients are using the fake-root/fallback name and which ones really are reporting to the root server.

I updated my analysis to include this check. This is what I am seeing for all relays, so far. We have over 400 relays so some have not checked in since I updated the analysis.

I used these to get the data.

if (exists relay service) Then ((value of setting "_BESGather_Download_CheckInternetFlag" of client as string) | "Not Set") else "Not a Relay"
if (exists relay service) Then ((value of setting "_BESGather_Download_CheckParentFlag" of client as string) | "Not Set") else "Not a Relay"

image

We only have 4 authenticating relays and they are all internet facing. Most of the systems getting the 403 error are connected to Internal Relays, not one of the authenticating relays. I checked this by looking for those with the _BESRelay_Comm_Authenticating set to 1.

Here is a ping from one of our lab systems, customers would get the same when pinging the URL in the masthead.

image

This is on our False Root…

I am at a loss of what is happening. The work around now is updating the content with the IP instead of hostname.

Changing the content is what we are doing now but I will discuss this with my team and see what they think.

Does the root server have a proxy configured, and not have an exception for itself? Check with

Besadmin /setproxy

It does have a proxy and, I will verify, but it does have exceptions for itself by hostname, URL, loopback and IP address.

That is what caused a bunch of issues last month. We had a, single, external site that would not update. Later found the proxy we were on had issues but was not apparent at first investigation. We eventually moved to another proxy during our attempts to diagnose.

A Gather Reset was part of the solution.

I have logged off for the day and have a late start tomorrow. I will verify the proxy is setup correctly, because that actually makes a lot of sense if it is not setup with the URL.

I also found out, the proxy used by BES Admin must be exactly the same as the proxy set at the system level. I found it odd but when it was not set with the same exceptions in the same order, it was not working correctly.

Also, system proxy uses the semicolon, but the BESadmin setting, you must use a comma. It would be nice if that was not the case.

I made some adjustments to the proxy to see if that resolves the issues.

I was associating the gather reset with the issue because the issue began at the same time we did the gather reset. However, it just did not click with me that we also changed the proxy during the content issue.

It makes a lot more sense that the issue is related to the proxy settings and not the gather reset. Let me ride with this for a bit and see if it continues.

2 Likes