Gatherer.bigfix.com outage

FYI - according to HCL

gatherer.bigfix.com Outage (hcltechsw.com)

  1. Most on-prem customers will be observing gather errors in their root server log. Content updates for existing subscribed sites are expected to work fine. In addition, any allocation changes made in FNO will not be effective until the gatherer service is restored.
  2. Air gapped customers will not be able to run the airgap tools for updates (both content and sites).
  3. New bigfix server installs (new S/N) will fail.
5 Likes

Thanks for sharing Boyd

Hi, is there any update when gatherer.bigfix.com will return?

As of right now, the traffic is being redirected to BigFix.flexnetoperations.com - and this domain is block with the customer.

The customer is also Airgapped and I can’t use the Tool for downloading new content

Just wanted to point out, that as of right now, If you are using the BigFix Airgap Tool and you have not blocked the BigFix.flexnetoperations.com - The Tool is able to contact and gather new content

Do we have any updates on this?

It is only by accident that I came across the KB article opened by HCL to document the issue. This type of global outage ought to be communicated through other channels, such as bigmail and especially here in the forums.

2 Likes

I agree, if you didn’t publish here we would have never heard of it and we would have spent hours in troubleshooting.

HCL needs to inform better when an outage that has big impact occurs.

1 Like

Looks like a database crashed (error 500). These are the last two lines in my BESRelay.log file:

Wed, 12 Jun 2024 10:22:34 -0700 - LicenseUpdater (20412) -   HTTPS connection to {https://gatherer.bigfix.com/cgi-bin/LicenseServerFrontend.pl} was unsuccessful due to {Unexpected HTTP response: 500 Internal Server Error}
Wed, 12 Jun 2024 11:22:43 -0700 - LicenseUpdater (20412) -   HTTPS connection to {https://gatherer.bigfix.com/cgi-bin/LicenseServerFrontend.pl} was unsuccessful due to {HTTP Error 47: Number of redirects hit maximum amount: Maximum (20) redirects followed}

That would be okay, what is not okay is that the service is not running on a fault-tolerant architecture, that is totally unacceptable for a service that provides content to thousands(?) of clients.

Agreed. This kind of outage should be avoidable.

From @JasonWalker over in the BigFix slack:

I can’t speak officially on it…it’s up now for some customers (including my lab deployments) but last update I see is ETA 12pm PST for everyone.

Reported issues are affecting our license and gather back-end systems since yesterday. BigFix team is working to address thes issues with highest priority.
At current time, content gathers in https mode are reported to be working correctly (http gathers still being worked on).
More timeouts and retries than expected are being reported; this is not expected to affect the system functionality.
Serial number allocation modification is also still experiencing issues.
We are continuing to drive these issues to resolution and will provide further updates by EOD today

2 Likes

https mode are not fully working

Thu, 13 Jun 2024 14:04:47 -0400 - LicenseUpdater (2476) - HTTPS gather for https://gatherer.bigfix.com/cgi-bin/LicenseServerFrontend.pl was unsuccessful. HTTP fallback is disabled

Was this caused by an expected change ?

Hello,
the reported error is most likely related by the timeout mentioned above, that is causing a fallback to http.

BigFix team continues to work with high focus on the resolution of the issues, testing changes as we go. We plan to have another update by 06/14 morning.

Thanks for your understanding.

The change was expected, this outcome was not.

Are you still getting this now? I’m told that should be working now.

Try setting _BESGather_Use_Https to 2 on the root if it is still not working to see if that addresses the issue. Also try the -usehttps command line parameter with the airgap tool.

1 Like

I got working at some point.

Thanks

1 Like

I’m not sure if this is the correct setting, but you might also need to set _BESData_Comm_TimeoutSeconds to 45 or higher at the moment to help with timeouts, though I’m not 100% sure if that is the settings that controls the gather timeout. I know it definitely affects other timeouts.

An update on the current situation. We are still actively engaged to bring all problems to resolution.
As of now, gather in both https and http mode are reported to be working correctly.
There are still occurrences of timeouts at a higher rate than expected, but this should not affect system operation.
Still outstanding is the issue about Serial Number allocation. We continue to work with highest focus to address this as well.
Planning to give another update before end of day today.
Thank you

1 Like

Issue with gathering, including timeouts, are reported to be significantly improved. Testing still in progress. As to Serial Number allocation, this is still undergoing analysis.
I plan to have another update tomorrow (Saturday).
Thanks

3 Likes

Tests today demonstrate good improvement in gather efficiency; timeouts are reduced and should not affect normal operation. Still working on the Serial Number allocation issue.
Next update on Sunday.


Sunday:
Gathers are still behaving. No other significant update today. Next update on Monday. Thank you

---------:
Monday: according to latest information, the serial number allocation is now working properly. We are continuing the investigation and monitoring the situation closely.
I will not post other updates unless anything major happens. If you still encounter issues, please open a support case.
Thank you