FillDB is slow after 9.5.6.63 upgrade

Andrey · August 18, 2017, 3:44pm

Hi All,

I hope someone can help us.

After we upgraded to 9.5.6.63 version the FillDB became very slow and we are facing more than 10,000 files in BufferDir.
So we start revising all of our infraestruture and we are not able to find out what is happening…

Our environment are distributed in 160,000 clients, 40 TOP Relays, 6,000 Relays where the RootServer is in Windows Server 8 R2 Virtualized in VMWare with 20 CPUs and 128 GB of RAM;
The Windows Server is using less than 30% of CPU and about 11% of Memory;
We have set ParallelismEnabled, DatabaseBoostLevel and both NumberOfDBUpdatingThreads and NumberOfParsingThreads to 5;
In addition we have verified the SQL Server (it is remote) which is normal and without issues as the network as well;
Also we have checked the VMWare Health which is good, including disk I/O (our VMWare Administritador Support told us it is oversized and he is reclaiming resources);
All of our clients are set to 1800 (30 min) the _BESClient_Report_MinimumInterval parameter;
We disabled the AV and still slow (the BigFix services are not on scan rules too);
It seems the bottle neck is the FillDB threads are not able to handle all of our reports, as the FillDBPerfLog bellow:

Fri, 18 Aug 2017 12:28:36 -0300 – 7088 – ----------- Parallel DB Update: 5145 messages in 134971 ms: 38 messages/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 – 0.16% full reports
Fri,
18 Aug 2017 12:28:36 -0300 – 7088 – ----------- Parallel DB Update
Summary for Fixlet results: 7644 rows with per thread mean rate: 8308
rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 – -----------
Parallel DB Update Summary for action results: 3168 rows with per thread
mean rate: 7764 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 –
----------- Parallel DB Update Summary for bad report numbers: 1346 rows
with per thread mean rate: 8059 rows/sec
Fri, 18 Aug 2017 12:28:36
-0300 – 7088 – ----------- Parallel DB Update Summary for computer
administrators: 8 rows with per thread mean rate: 69 rows/sec
Fri, 18
Aug 2017 12:28:36 -0300 – 7088 – ----------- Parallel DB Update
Summary for computer relay statuses: 11 rows with per thread mean rate:
211 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 – -----------
Parallel DB Update Summary for computer roles: 1211 rows with per thread
mean rate: 557 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 –
----------- Parallel DB Update Summary for computer sequences: 3645 rows
with per thread mean rate: 17694 rows/sec
Fri, 18 Aug 2017 12:28:36
-0300 – 7088 – ----------- Parallel DB Update Summary for computer
sites subscription: 58 rows with per thread mean rate: 773 rows/sec
Fri,
18 Aug 2017 12:28:36 -0300 – 7088 – ----------- Parallel DB Update
Summary for dynamic computer groups: 4014 rows with per thread mean
rate: 18757 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 7088 –
----------- Parallel DB Update Summary for long property results: 1500
rows with per thread mean rate: 398 rows/sec
Fri, 18 Aug 2017
12:28:36 -0300 – 7088 – ----------- Parallel DB Update Summary for
short property results: 11352 rows with per thread mean rate: 6968
rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 8732 – short property results: 1220 rows in 206 ms: 5922 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 8604 – action results: 351 rows in 46 ms: 7630 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 3440 – short property results: 1523 rows in 215 ms: 7083 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 8732 – action results: 384 rows in 48 ms: 8000 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 – 3440 – action results: 515 rows in 59 ms: 8728 rows/sec
Fri, 18 Aug 2017 12:29:47 -0300 – 2880 – ----------- Parallel Parsing: 5004 messages in 109534 ms: 45 messages/sec

cmcannady · August 18, 2017, 5:34pm

After upgrading from 9.5.4 to 9.5.5 we experienced significant performance issues due to the ordinality of our BES data. Extensive troubleshooting with the IBM L2/L3 teams we identified DB2 index and related issues that required non-GA solutions implementation. We’re not WIN/SQL Server, so I’m not sure if our situation is applicable given the environment details you provided.

Do you have any FillDBPerf.log data from before the upgrade where you could compare similar updates and their respective rows/second?

Have you tried disabling the parallelism feature for BESFillDB and rechecking the rows/second?

IBM recommended to us that we keep the parallelism to 3/3 threads for BufferDir/Query channels respectively.

Are you seeing any deadlocks and/or row/table locks at your database layer?

On Linux systems we would leverage NMON to monitor CPU utilization, memory consumption, storage I/O, etc. Have you checked the analogous systems monitor solution on your Windows root BES server? Do you see increased latency at the storage and/or network I/O level?

Finally, have you enabled verbose logging for the BESRootServer service? Is anything jumping out from that log stream?

Andrey · August 21, 2017, 11:03am

Thanks for your reply Casey!!

We checked the Perfomance Monitor on Windows and every thing is fine in both OS and VMWare (I/O, CPU, RAM). Also, the DB is working without issues.
However I found an old FillDBPerfLog here and I’m able to see that before upgrade from 9.5.5 to 9.5.6 we had FillDB parcing about 500 messages/sec and now it is only in 45 messages/sec with the same infrastruture.
So, the question is: What was changed from BigFix from 9.5.5 to 9.5.6 version with FillDB to descread the parsing numbers?
We have createad a PMR to IBM check this isse as well.

Thanks for your help.

cmcannady · August 21, 2017, 6:45pm

@Andrey—

Happy to help.

With that FillDBPerf log details can you determine which table dropped from 500/sec to 45/sec? It may be necessary to review the indexs on that table. In our case we had to drop all indexes on impacted tables and then put new (non-GA) indexes in place and then re-run stats to get our performance back into pre-upgrade levels.

If you private message me I can give you the PMR number from our BES upgrade issues and DB2 reindexing efforts. It may or may not be helpful to L2/L3 with your PMR.

Best,
—@cmcannady

Andrey · August 22, 2017, 11:34am

HI Casey,
Thanks again!!!
I double checked with DBA and when he was verifying the DB health he rebuild all of index as well.
Actually we are suspecting from signature of reports delivered in last BigFix versions because as the of logs bellow this issue is related with the capability of FIllDB parcing files and not in uploading data to SQL Server:

Before BigFix upgrade:

Wed, 26 Apr 2017 09:35:48 -0300 – 8112 – Replication: 26 ms^M
Wed, 26 Apr 2017 09:35:48 -0300 – 8112 – GetComputerSequences: 26 ms^M
Wed, 26 Apr 2017 09:35:48 -0300 – 8112 – DeleteUnusedSites: 23 ms^M
Wed, 26 Apr 2017 09:35:58 -0300 – 8112 – GetAnalysisPropertyMappings: 15 ms^M
Wed, 26 Apr 2017 09:35:58 -0300 – 8112 – GetStatisticalPropertyIDs: 102 ms^M
Wed, 26 Apr 2017 09:35:58 -0300 – 8112 – GetNonReportingFixlets: 37 ms^M
Wed, 26 Apr 2017 09:35:58 -0300 – 8112 – GetComputersWithCertificates: 560 ms^M
Wed, 26 Apr 2017 09:35:58 -0300 – 8112 – UpdateMappings complete in: 719 ms^M
Wed, 26 Apr 2017 09:36:00 -0300 – 8112 – Parsing: 716 messages in 1443 ms: 496 messages/sec^M
Wed, 26 Apr 2017 09:36:01 -0300 – 8112 – computer sequences: 716 rows in 576 ms: 1243 rows/sec^M
Wed, 26 Apr 2017 09:36:01 -0300 – 8112 – bad report numbers: 2 rows in 2 ms: 1000 rows/sec^M
Wed, 26 Apr 2017 09:36:01 -0300 – 8112 – Fixlet results: 45 rows in 40 ms: 1125 rows/sec^M
Wed, 26 Apr 2017 09:36:01 -0300 – 8112 – computer roles: 3 rows in 17 ms: 176 rows/sec^M
Wed, 26 Apr 2017 09:36:01 -0300 – 8112 – long property results: 51 rows in 161 ms: 316 rows/sec^M
Wed, 26 Apr 2017 09:36:02 -0300 – 8112 – short property results: 1845 rows in 1648 ms: 1119 rows/sec^M
Wed, 26 Apr 2017 09:36:02 -0300 – 8112 – non-sampled statistics: 22 rows in 83 ms: 265 rows/sec^M
Wed, 26 Apr 2017 09:36:02 -0300 – 8112 – action results: 37 rows in 35 ms: 1057 rows/sec^M
Wed, 26 Apr 2017 09:36:02 -0300 – 8112 – ----------- Batch Complete: 716 messages in 4734 ms: 151 messages/sec^M
Wed, 26 Apr 2017 09:36:02 -0300 – 8112 – 100.0% full reports

After BigFix upgrade:

Mon, 21 Aug 2017 07:30:48 -0300 – 7500 – ----------- Parallel DB Update (Short Batch): 17 messages in 121 ms: 140 messages/sec
Mon, 21 Aug 2017 07:30:48 -0300 – 7500 – 0.0% full reports
Mon, 21 Aug 2017 07:30:48 -0300 – 7500 – ----------- Parallel DB Update Summary for computer sequences: 17 rows with per thread mean rate: 346 rows/sec
Mon, 21 Aug 2017 07:30:48 -0300 – 7500 – ----------- Parallel DB Update Summary for long property results: 2 rows with per thread mean rate: 44 rows/sec
Mon, 21 Aug 2017 07:30:48 -0300 – 7500 – ----------- Parallel DB Update Summary for short property results: 31 rows with per thread mean rate: 418 rows/sec
Mon, 21 Aug 2017 07:30:48 -0300 – 8268 – GetComputersWithCertificates: 326 ms
Mon, 21 Aug 2017 07:30:48 -0300 – 8268 – UpdateMappings complete in: 426 ms
Mon, 21 Aug 2017 07:30:49 -0300 – 6636 – ----------- Parallel Parsing: 10 messages in 295 ms: 33 messages/sec

cmcannady · August 23, 2017, 2:49am

Wow, that is a significant performance impact. I just finished installing 9.5.6.63 on the infrastructure in our performance lab and am going to start running it through its paces. I’ll definitely be paying attention to the FIllDBPerf.log

Have you tried disabling the FIllDB and Query parallelism configurations on your root BES server and rechecking the BESFIllDB and FillDBPerf logs? Also curious if you have DatabaseBoostLevel configured on your root BES server? Have you tried disabling DatabaseBoostLevel in favor of BESFillDB parallelism?

Andrey · August 23, 2017, 10:29am

When we verified this issue could be related with parallelism we tried to disable that but it was unsuccessfully.
Also we have enabled DatabaseBoostLevel parameter but this is just to increase the upload performance on DB and is not related with parsing processes.
I think we have to wait wait IBM replys our PMR to try figure out what is happening.

steve · August 24, 2017, 2:46pm

From the data you provided, it looks like your operators are sending refreshes to computers in the console which is overwhelming your server. Your performance after the upgrade improved about 7x from before the upgrade, at least according to the initial log data you provided.

Before:

Wed, 26 Apr 2017 09:36:01 -0300 -- 8112 -- Fixlet results: 45 rows in 40 ms: 1125 rows/sec
Wed, 26 Apr 2017 09:36:01 -0300 -- 8112 -- long property results: 51 rows in 161 ms: 316 rows/sec
Wed, 26 Apr 2017 09:36:02 -0300 -- 8112 -- short property results: 1845 rows in 1648 ms: 1119 rows/sec
Wed, 26 Apr 2017 09:36:02 -0300 -- 8112 -- action results: 37 rows in 35 ms: 1057 rows/sec

After:

Fri, 18 Aug 2017 12:28:36 -0300 -- 3440 -- short property results: 1523 rows in 215 ms: 7083 rows/sec
Fri, 18 Aug 2017 12:28:36 -0300 -- 8732 -- action results: 384 rows in 48 ms: 8000 rows/sec

So data is moving into the database very quickly, but the problem is you are trying to process something like 20x the amount of data as normal. The “full reports” indicator in the log is actually inverted, so in your pre-upgrade log it says 100% full reports, which actually means 0%. In the post-upgrade log it indicates 99.84% full reports based on the following:

While it is natural for operators to try and refresh their endpoints when they are grayed out, it does create load on the system, and in this case they are probably doing it on large sets of endpoints and repeatedly. So you need to get your operators to stop sending refreshes and give the system a chance to catch up. That is the only way out. We did add controls to prevent operators from sending more than one refresh at a time or to prevent sending refreshes a few releases back to help prevent such situations.

steve · August 24, 2017, 2:53pm

Forgot to clarify, this is also why the message rate changed. A message is one report, so if your insertion rates (rows/sec) are the same or higher, but your message rates (messages/sec) are lower then it means your message (report) size is bigger. A standard client report would usually have 4-10 rows of data to insert into the database, but a full report can have over 200 rows of data. So processing full reports from all endpoints is equivalent to having 20x more endpoints than normal. Even a very performant server cannot keep up with that load for an extended period.

cmcannady · September 17, 2017, 6:24am

@Andrey,

We performed our production BigFix upgrade from v9.5.5.193 to v9.5.6.63 overnight and with the exception of some environmental issues (not BigFix related) it went well. Elapsed time was approximately 5 hours to upgrade the root BES server and all BESRelays across our enterprise (3,000+ sites).

As with all upgrades, I monitored BESRelay.log and FIllDBPerf.log on the root BES server. FillDBPerf.log showed that palatalization and overall performance (i.e. rows/second) were good. BESRelay.log showed the usual errors after an upgrade, but nothing out of the ordinary. Overall very pleased with this upgrade so far from the infrastructure side. Next week we’ll start rolling the BESClient upgrades to our NIX endpoints.

Best,
@cmcannady

AlexaVonTess · September 18, 2017, 3:12pm

May I ask the size of your database, number of endpoints, and if you elected to temporarily increase the client report time to 6 hours prior to the update (as is recommended but I never do)?

cmcannady · September 18, 2017, 3:25pm

We’re Linux/DB2 and BFENT is using roughly 185GB of storage (lots of custom content, properties, etc). Number of endpoints is roughly 100,000 across 3,000+ distinct sites. We always issue upgrade content so that endpoint reporting and fail-over is set to 6 hours.

AlexaVonTess · September 18, 2017, 3:31pm

Thanks. I guess you would want it at 6 hours for so many. We’re at 20GB with about 8,000 endpoints. Maybe not too important for me to make that change prior to our future update from 9.5.5 to 9.5.7.

cmcannady · September 18, 2017, 3:43pm

It’s definitely a best practice for us. Our relay infrastructure (3,100+ dedicated servers) gets hit hard during upgrades if we don’t push reporting and fail-over to 6 hours.

With SOP (i.e. snapshots, backups, etc.) our upgrades can push that 6 hour mark and we don’t want clients failing over to our data centers. We do have “fake root” implemented, so the root BES server wouldn’t be impacted, but the client traffic wouldn’t be appreciated by our NOC.

BigFixNinja · September 18, 2017, 7:31pm

The condition @steve is referring to is documented here:

http://www-01.ibm.com/support/docview.wss?uid=swg21688336

Along with advice and controls to help control and limit the behavior or your operators to prevent them from abusing the Send Refresh functionality.

BigFixNinja · September 18, 2017, 7:33pm

And additional point about the Send Refresh functionality:

What does the Send Refresh on the right click menu of a computer do in the BigFix Console?

Andrey · September 19, 2017, 2:49pm

Great news Casey!!!
I’m very happy to know your environment is working fine after upgrade.
Since we are only 5 or 6 opperators, we decided to change the parameter _BESRelay_PostResults_ResultTimeLimit to 5 min and seems to be working good.
Thanks all for your help!!

SLB · October 12, 2017, 2:17pm

Just for info, we’ve started seeing FillDB issues too following the upgrade to 9.5.6 (from 9.5.4). Our BufferDIR was flowing freely under 9.5.4 and rarely peaked above 200 files, now we’re maxing out on 10000 file and a 2hr lag on data being committed to the DB. Perflog shows between 20-40 messages per second and a similar 0.xx % full reports and neither of the 2 operators are sending full refresh requests. We have 110,000 endpoints and over 1000 relays. The server is an 8 core Xeon with 64 GB RAM, has the DB on a RAID5 SSD volume and the application on a RAID10 HDD. Our DB is 55GB in size. There have been no issue running this hardware and started on a clean 9.1.1117.0 installation and have upgraded to from 9.1 through 9.2 and 9.5 up to patch 4 without ever having any issues. The issue also seems to be impacting Web Reports and groups using soap calls to pull data via web reports are taking so long it seems to cause the web report service to crash. A PMR is open so hopefully the cause can be found and fixed.

SLB · October 12, 2017, 3:10pm

We had contact from IBM and after updating FillDB.exe to version 9.5.6.65 our FillDB performance is back to normal and message per second has multiplied tenfold and we now see 300+ messages per second being processed.

Top marks and a big thank you to IBM support for the speed of support and fixing the problem.

626967666978 · March 1, 2018, 12:06am

Did the update of filldb.exe to v9.5.6.65 resolve the issue you were seeing with WebReports crashing? We are seeing something similar with WebReports on 9.5.4.38.