Intermittent heartbeat issues after upgrading to 9.5.11

Meydey · March 5, 2019, 9:27pm

I thought I would throw this out there while I am compiling data for a PMR.
We have been having intermittent issues with heartbeat and action response from a large number of random servers. Upgraded to 9.5.11 on Jan 30. Since then I have had SQL drop about 8 times and have had to bounce FillDB service the same. Parallelism is enabled. Enhanced Security is enabled.
Came in yesterday (Monday) and about 1/3 (5k) of all endpoints were grey in console with a last check in time of 3/2(Friday). Did the usual, clear cache, refresh, etc, no change. The weird thing though, checking BESclient logs on multiple endpoints showed no issue. Endpoints were reporting, good synchs, responding to actions, but not reporting back. Case in point was my own pc, which was on with no issues all weekend.
This is what I saw at 10am 3/4.

Client log history was missing 3/2-3/2

This is the response after a Force Refresh:
At 10:32:57 -0800 -
ForceRefresh command received. Version difference, gathering action site.
At 10:33:35 -0800 -
Successful Synchronization with site ‘actionsite’ (version 1070603) - ***
At 10:33:46 -0800 -
Gathering all operator/mailbox sites.

Ran a BESClient Diagnostics and no error came back. Bounced FillDB to mod the log file size, and my pc finally checked in at 2pm. When it checked in all of my test fixlets that I ran the last few hours all went from Not Reported to Completed. This was not an issue prior to the upgrade. Also I was chatting with Mark Leaphart the whole time and he was stumped.

Anyone have anything I should look at?

Edit - Just checked and this ep has not checked in in 2.5 hours today.

jpluff · March 8, 2019, 2:02pm

I don’t have a solution, just reporting a related issue regarding FillDB. We are still on 9.2.8.74 and we have a scheduled task to bounce FillDB daily as a bandaid fix to force the ILMT VM manager data to refresh. We are planning to upgrade to 9.5 as soon as we get our new hardware in, and hoping to land on a version that is more stable. Appears 9.5.11 won’t be the one we pick.

rival178 · March 8, 2019, 2:30pm

I am on 9.5.10 and have the same issue, I’ve noticed this on earlier versions also. I’ve not found a solution and PMR’s never found any issues. All upgrades to newer version of bigfix have NOT helped the issue.

Meydey · March 8, 2019, 4:03pm

Mine could be issues with QOS blocking the inbound TCP. We share a scavenge queue with backups and other data, and there has been an elevated backup schedule for our PCI servers at remote locations hogging the connections. Been trying to convince out Networking team that BF deserves its own queue.
Also it has been suggested in other posts to setup a job to restart FillDB on a daily basis. May look into that.

steve · March 9, 2019, 6:13am

When data isn’t updated in the console it’s definitely due to either reports not getting to the server, or problems with FillDB processing the reports. If you don’t see reports in your FillDB bufferdir, then the QOS blocking could be the problem (or something else preventing the reports from making it to the server).

If you do see reporting in your bufferdir, then there is an issue with FillDB. Turning on FillDB performance logging is a good way to see if and how fast it is processing things. Restarting FillDB daily shouldn’t be necessary, and might mask a more serious problem.

cdh · April 1, 2019, 7:26pm

Had a similar problem that involved FillDB.

The issues was traced to BigFix runningits nightly DB maintenance job at about midnight AND our BES Computer Removal task via BES Admin Tool was configured to run exactly at midnight.

No issues before 9.5.11 but shortly after the upgrade, the console would show all grey, the FillDB bufferdir was loaded, but FillDB was just not inserting those reports. Restarting the FillDB service remedied that and everything went back to normal.

Per our DBAs that are in charge of maintaining BFEnterprise, they said the system resources would hit 100% until FillDB was restarted. Tracked the issue down in the logs to about when the maintenance jobs ran. So changed the time of BES Computer removal job and we haven’t had the issue since (going on a week or so).

Up until this upgrade, we never had this issue. However, the logs showed DB connection issues around the times the jobs ran so FillDB seemingly was able to recover. Why that stopped in 9.5.11, I would like to know if I had more time to dig.

Meydey · April 1, 2019, 9:11pm

Interesting. My issue occurs weekly and it happened again Friday night at 12 midnight. Went and checked and we run Computer Remover every 7 days…at midnight.
Is there a log that tracks BES computer remover process start?

cdh · April 1, 2019, 10:20pm

Yup there is a log - BESTools.log in your BES Server folder.