We found the cause of the problem.
The reboot of the relays actually didn’t fix the problem. It turns out that at the same time, someone was logged in to the core servers and was setting the FillDB Carbon Copy setting in preparation for upcoming performance testing. The directory path they specified for the setting had a typo for the drive letter. It turns out that this typo caused FillDB to become non-functional. When they removed the setting, things started working again.
Since this discovery, we have reproduced the problem in the lab environment several times. So this was the source of the Server Internal Error messages seen in the Top Level Relay logs mentioned in my first post.
The server does not need to be restarted in order for the problem to occur. As soon as the incorrect setting is set, FillDB will stop processing reports.
If the server is restarted, it will fail to start up, and the log will end with this:
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - BES Root Server version 9.5.11.113 built for WINVER 6.0 x86_64 running on WINVER 10.0.14393 i386, starting
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - OpenSSL Initialized (FIPS Mode)
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Using OpenSSL crypto library libBEScryptoFIPS64 - OpenSSL 1.0.2p-fips 14 Aug 2018
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully connected to database
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Signature Algorithms: sha256, sha1
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Download Algorithms: sha256, sha1
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - TLS Cipher List: HIGH:!ADH:!AECDH:!kDH:!kECDH:!PSK:!SRP
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully read server signing key
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully read client CA key
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Error detected during Run(): File error "class DirectoryNotFoundError" on "W:\FillDBCC" : "Windows Error 0x3%: The system cannot find the path specified."
The Windows DirectoryNotFoundError is the indicator here. Fix the setting value and the server will start again, also FillDB will function correctly.
This behavior is concerning because we do performance testing on production data using FillDB Carbon Copy. On production cores we set it up to point to the FillDB BufferDir of BigFix server installations that we want to test. Sometimes the installations we want to test are on new hardware, other times it is a new version of BigFix that we want to performance test. It is most helpful to record this performance test data for over 24hrs, but with knowing now that a network outage could stop the production core server’s FillDB from processing, it seems too risky to have FillDB Carbon Copy enabled without supervision. That’s unfortunate, because it means we will have to settle for less than 24-hs of performance testing data unless we want to risk an extended production outage happening after business hours.