Relays Down, Logs Say "Internal Server Error 500"

quest · January 14, 2020, 11:37pm

Today this customer rebooted most of their Top Level Relays to overcome a situation where they were unable to post results for over an hour. The BigFix Core Server had an empty FillDB BufferDir during this time. We were able to verify that the Relays and Core Server could connect to each other, but for some reason, none of the relays could post results. We are unable to explain what caused this, or why a reboot of the relays fixed it.

Around the time when the problem occurred, the BigFix Relay logs had many errors like this:

Tue, 14 Jan 2020 13:18:30 -0500 - 2917133120 - PostResultsForwarder ( HTTP Error 500 ): Server internal error.

What’s strange about this is the error above seems to indicate a problem encountered by the BigFix Core Server when it attempts to process the relay’s POST request. One would expect the issue to be solved by troubleshooting the Core Server, but it turned out to be resolved by rebooting the relays instead. There did not appear to be anything related to the issue in either the Core Server’s BESRelay.log or FillDB.log. The only sign of the issue on the Core Server was an empty FillDB BufferDir during the duration of this issue.

What’s even stranger is this occurred on two separate BigFix installations here at roughly the same time.

We would really like to know what caused this so we can focus on preventing it from happening again in the future.

Our BigFix Core Server and Relays all run BigFix version 9.5.12.

FDA · January 16, 2020, 10:52am

The information you have colleceted don’t allow to perform a complete problem determination. The server debug log would help in this case. I think the best thing to do is to contact the support, who will likely ask for more information to further investigate the problem.

quest · January 16, 2020, 9:03pm

We found the cause of the problem.

The reboot of the relays actually didn’t fix the problem. It turns out that at the same time, someone was logged in to the core servers and was setting the FillDB Carbon Copy setting in preparation for upcoming performance testing. The directory path they specified for the setting had a typo for the drive letter. It turns out that this typo caused FillDB to become non-functional. When they removed the setting, things started working again.

Since this discovery, we have reproduced the problem in the lab environment several times. So this was the source of the Server Internal Error messages seen in the Top Level Relay logs mentioned in my first post.

The server does not need to be restarted in order for the problem to occur. As soon as the incorrect setting is set, FillDB will stop processing reports.

If the server is restarted, it will fail to start up, and the log will end with this:

Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - BES Root Server version 9.5.11.113 built for WINVER 6.0 x86_64 running on WINVER 10.0.14393 i386, starting
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - OpenSSL Initialized (FIPS Mode)
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Using OpenSSL crypto library libBEScryptoFIPS64 - OpenSSL 1.0.2p-fips  14 Aug 2018
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully connected to database
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Signature Algorithms: sha256, sha1
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Download Algorithms: sha256, sha1
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - TLS Cipher List: HIGH:!ADH:!AECDH:!kDH:!kECDH:!PSK:!SRP
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully read server signing key
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Successfully read client CA key
Thu, 16 Jan 2020 15:53:05 -0500 - Main Thread (10132) - Error detected during Run(): File error "class DirectoryNotFoundError" on "W:\FillDBCC" : "Windows Error 0x3%: The system cannot find the path specified."

The Windows DirectoryNotFoundError is the indicator here. Fix the setting value and the server will start again, also FillDB will function correctly.

This behavior is concerning because we do performance testing on production data using FillDB Carbon Copy. On production cores we set it up to point to the FillDB BufferDir of BigFix server installations that we want to test. Sometimes the installations we want to test are on new hardware, other times it is a new version of BigFix that we want to performance test. It is most helpful to record this performance test data for over 24hrs, but with knowing now that a network outage could stop the production core server’s FillDB from processing, it seems too risky to have FillDB Carbon Copy enabled without supervision. That’s unfortunate, because it means we will have to settle for less than 24-hs of performance testing data unless we want to risk an extended production outage happening after business hours.

FDA · January 17, 2020, 9:08am

@quest Please open a service ticket for the FillDB carbon copy problem, The service team will evaluate to fix it. I am interested about the performance testing that you are going to setup. We have a performance team here in HCL, they did several tests of filldb, most of the results are collected in the “capacity plan” document available here:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Tivoli%20Endpoint%20Manager/page/Performance%20&%20Capacity%20Planning

Let me know If you are looking for something else.