Webui and Filldb bogging down,

Im seeing an occasional issue where the endpoints all grey out in the console from time to time (8,000+)
I have traced the problem to the FillDBData\Bufferdir folder just filling up with 800+ entries and not clearing down.
typically I normally see a max of 6 to 10 entries before they clear down and repeat but this is becoming more troublesome with time.

To stop the problem I used to just restart the BES Root Server Service but I have found that it may be related to the BESWebui Service,
Restarting that instead of the Root Server alo clears down the data

The Webui is hosted by the main server and not the relays. I realise this isnt the best combination.
Couple with the fact the SQL server is also on the same hardware, Im thinking a performance issue somewhere but why does it just choke and not restart?

I am seeing quite a lot of entries like
Fri, 28 Apr 2017 13:37:16 -0400 – 5524 – Encountered at least one error while processing archive file C:\Program Files (x86)\BigFix Enterprise\BES Server\FillDBData\bufferdir\00000000000d5532 in buffer; the archive was processed, but at least one part of it was discarded.
or
Unexpected exception encountered parsing file C:\Program Files (x86)\BigFix Enterprise\BES Server\FillDBData\bufferdir\00000000000b0d86; discarding: Client report signing certificate computer ID didn’t match computer ID of report. Discarding message from computer 2502848

Anyone any Ideas?

What version are you running? I’ve been seeing similar things, that I think are related to processing a lot of files in the Uploads folder. With DSA replication the FillDB service is responsible for replicating the Uploads folder between servers.

We’re still on 9.5.4, and our PMR is on hold pending upgrade to 9.5.5. Support indicates that the multithreading in the 9.5.5 version of FillDB should help.

We’ve also been seeing FillDB_X.dmp files in the BES Server directory.

One of our DSA servers was additionally running Web Reports and Compliance. I moved those functions to another server last week to try to reduce the workload on the server, and so far that seems to have helped.

1 Like

The WebUI ETL process can interfere with FillDB processing. This behavior is usually evident only when the WebUI is performing the initial ETL, thus completely recreating its SQLite database. During this process, the WebUI transfers large amounts of data from the Root Server, which keeps its database locked for a time as long as the transfer time.
For a deployment with a large number of endpoints, the time can be significant, usually hours.

From your description I think this is the root cause of your issue. To double check, please look at the WebUI ETL.log file and check if any ETL was running at the time you experienced the FillDB slow down. If the analysis confirms that this is the issue and that the WebUI is executing the initial ETL, the only known workaround is to let the WebUI run the ETL in off peak times, for example starting the WebUI either in the evening or letting it run over the weekend. Once the WebUI DB has been repopulated, you should no longer experience the issue (unless the WebUI decides to rebuild its database, which could happen after an upgrade).

Just as a side note, this locking behavior applies only to Windows/SQLServer while Linux/DB2 is not affected. We are working to solve this issue in a future patch.

This seems to be unrelated from the slow down issue. This one appears to be a specific problem of the computer 2502848, which appears to sign reports with a certificate associated with a computer ID different than the computer ID. For further diagnosing the issue we should access the client certificates to check their content. If you have not yet opened a PMR, I would suggest opening one, so that we can use it to exchange files.

1 Like

Jason, This is the latest 9.5.5 …
Thanks bpastore… I will double check this the logs next time I see it happen…
Its seems stable recently … (must have heard me :wink:

@bpastore.
Thanks Bernado
Well It happened again.
FillDB reached 465 files and stopped…
If you send an action to a client, it performs the action but gradually all the clients grey out after the 50 minute default setting we have…

This is the ETL log just before that point

Thu, 04 May 2017 15:42:55 GMT bf:bfetl:debug GET https://https://servername.xx.yy.zzz:52703/api/etl/computer-property-info?sequence=2027215547
Thu, 04 May 2017 15:45:08 GMT bf:bfetl:debug Failed to update COMPUTER_PROPERTY_INFO: Curl failed: Transferred a partial file
Thu, 04 May 2017 15:45:08 GMT bf:bfetl:debug GET https://servername.xx.yy.zzz:52703/api/etl/computer-property-text?sequence=2027215972

Im swear this thing knows when im talking about it,… It just freed up by itself…

Just curious: at what time did the ETL complete? For example if the ETL took 60 minutes, it would justify why FillDB was stuck for an amount of time that would justify agents to be assumed as not reporting…

Here is the complete section of that log… it still hasnt completed.

Thu, 04 May 2017 06:50:26 GMT bf:bfetl:debug Running cleanup
Thu, 04 May 2017 06:59:59 GMT bf:bfetl:debug Running analyze with a threshold of 1000
Thu, 04 May 2017 06:59:59 GMT bf:bfetl:debug Updating statistics on ACTIONS
Thu, 04 May 2017 07:00:04 GMT bf:bfetl:debug Updating statistics on ACTION_TARGET_STATIC
Thu, 04 May 2017 07:05:34 GMT bf:bfetl:debug Updating statistics on COMPUTER_ACTIONS
Thu, 04 May 2017 07:10:31 GMT bf:bfetl:debug Updating statistics on COMPUTER_ANALYSES
Thu, 04 May 2017 07:24:26 GMT bf:bfetl:debug Updating statistics on COMPUTER_GROUPS
Thu, 04 May 2017 07:25:25 GMT bf:bfetl:debug Updating statistics on COMPUTER_PROPERTY_INFO
Thu, 04 May 2017 08:07:23 GMT bf:bfetl:debug Updating statistics on COMPUTER_PROPERTY_TEXT
Thu, 04 May 2017 10:38:55 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLET_ACTIONS
Thu, 04 May 2017 10:39:12 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLET_ACTION_TRANSLATIONS
Thu, 04 May 2017 10:40:49 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLET_FIELDS
Thu, 04 May 2017 10:41:11 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLET_RELEVANCE
Thu, 04 May 2017 10:41:33 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLET_TRANSLATIONS
Thu, 04 May 2017 10:49:49 GMT bf:bfetl:debug Updating statistics on COMPUTERS
Thu, 04 May 2017 10:49:50 GMT bf:bfetl:debug Updating statistics on COMPUTER_BASELINES
Thu, 04 May 2017 10:51:45 GMT bf:bfetl:debug Updating statistics on COMPUTER_FIXLETS
Thu, 04 May 2017 13:56:20 GMT bf:bfetl:debug Updating statistics on COMPUTER_ROLES
Thu, 04 May 2017 13:56:51 GMT bf:bfetl:debug Updating statistics on COMPUTER_SITES
Thu, 04 May 2017 15:36:53 GMT bf:bfetl:debug Updating statistics on COMPUTER_USERS
Thu, 04 May 2017 15:38:09 GMT bf:bfetl:debug Updating statistics on EXTERNAL_FIXLETS
Thu, 04 May 2017 15:38:20 GMT bf:bfetl:debug Updating statistics on SITE_USERS
Thu, 04 May 2017 15:38:20 GMT bf:bfetl:debug Sending request for 66 tables
Thu, 04 May 2017 15:38:20 GMT bf:bfetl:debug Received request for 66 tables
Thu, 04 May 2017 15:38:21 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/actions?sequence=2027215536
Thu, 04 May 2017 15:39:41 GMT bf:bfetl:debug Updated ACTIONS 1939 rows in 79.712 seconds (24 rows per second)
Thu, 04 May 2017 15:40:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/action-fields?sequence=2027215537
Thu, 04 May 2017 15:40:57 GMT bf:bfetl:debug Updated ACTION_FIELDS 0 rows in 1.953 seconds
Thu, 04 May 2017 15:40:57 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/action-parameters?sequence=2027215537
Thu, 04 May 2017 15:40:59 GMT bf:bfetl:debug Updated ACTION_PARAMETERS 0 rows in 1.938 seconds
Thu, 04 May 2017 15:40:59 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/action-target-names?sequence=2027215537
Thu, 04 May 2017 15:41:01 GMT bf:bfetl:debug Updated ACTION_TARGET_NAMES 0 rows in 1.859 seconds
Thu, 04 May 2017 15:41:01 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/action-target-static?sequence=2027215537
Thu, 04 May 2017 15:41:26 GMT bf:bfetl:debug Updated ACTION_TARGET_STATIC 20985 rows in 25.491 seconds (823 rows per second)
Thu, 04 May 2017 15:41:27 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/action-user-groups?sequence=2027215540
Thu, 04 May 2017 15:41:29 GMT bf:bfetl:debug Updated ACTION_USER_GROUPS 0 rows in 1.907 seconds
Thu, 04 May 2017 15:41:29 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/analysis-activations?sequence=2027215540
Thu, 04 May 2017 15:41:29 GMT bf:bfetl:debug Updated ANALYSIS_ACTIVATIONS 0 rows in 0.047 seconds
Thu, 04 May 2017 15:41:29 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/computer-actions?sequence=2027215540
Thu, 04 May 2017 15:41:56 GMT bf:bfetl:debug Updated COMPUTER_ACTIONS 9779 rows in 27.033 seconds (361 rows per second)
Thu, 04 May 2017 15:41:56 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/computer-analyses?sequence=2027215540
Thu, 04 May 2017 15:42:37 GMT bf:bfetl:debug Updated COMPUTER_ANALYSES 8062 rows in 40.238 seconds (200 rows per second)
Thu, 04 May 2017 15:42:37 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/computer-groups?sequence=2027215547
Thu, 04 May 2017 15:42:54 GMT bf:bfetl:debug Updated COMPUTER_GROUPS 1518 rows in 17.314 seconds (87 rows per second)
Thu, 04 May 2017 15:42:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/computer-property-info?sequence=2027215547
Thu, 04 May 2017 15:45:08 GMT bf:bfetl:debug Failed to update COMPUTER_PROPERTY_INFO: Curl failed: Transferred a partial file
Thu, 04 May 2017 15:45:08 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/computer-property-text?sequence=2027215972
Thu, 04 May 2017 16:21:21 GMT bf:bfetl:debug Updated COMPUTER_PROPERTY_TEXT 963973 rows in 2172.744 seconds (443 rows per second)
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-analyses?sequence=2027216669
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_ANALYSES 0 rows in 0.031 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-analysis-fields?sequence=2027216669
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_ANALYSIS_FIELDS 0 rows in 0.063 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-analysis-properties?sequence=2027216677
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_ANALYSIS_PROPERTIES 0 rows in 0.031 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-analysis-relevance?sequence=2027216677
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_ANALYSIS_RELEVANCE 0 rows in 0.031 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-action-settings?sequence=2027216677
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_ACTION_SETTINGS 0 rows in 0.094 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-action-settings-user-groups?sequence=2027216677
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_ACTION_SETTINGS_USER_GROUPS 0 rows in 0.094 seconds
Thu, 04 May 2017 16:38:45 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-components?sequence=2027216677
Thu, 04 May 2017 16:38:51 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_COMPONENTS 35 rows in 5.562 seconds (6 rows per second)
Thu, 04 May 2017 16:38:51 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-component-actions?sequence=2027216677
Thu, 04 May 2017 16:38:51 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_COMPONENT_ACTIONS 35 rows in 0.609 seconds (57 rows per second)
Thu, 04 May 2017 16:38:51 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-component-action-success?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_COMPONENT_ACTION_SUCCESS 35 rows in 0.235 seconds (148 rows per second)
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-component-groups?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_COMPONENT_GROUPS 5 rows in 0.140 seconds (35 rows per second)
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-fields?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_FIELDS 0 rows in 0.110 seconds
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-baseline-relevance?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_BASELINE_RELEVANCE 4 rows in 0.250 seconds (16 rows per second)
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-actions?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_ACTIONS 2 rows in 0.156 seconds (12 rows per second)
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-action-settings?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_ACTION_SETTINGS 0 rows in 0.031 seconds
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-action-settings-user-groups?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_ACTION_SETTINGS_USER_GROUPS 0 rows in 0.032 seconds
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-action-success?sequence=2027216677
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_ACTION_SUCCESS 1 rows in 0.062 seconds (16 rows per second)
Thu, 04 May 2017 16:38:52 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-fields?sequence=2027216677
Thu, 04 May 2017 16:38:53 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_FIELDS 1 rows in 0.235 seconds (4 rows per second)
Thu, 04 May 2017 16:38:53 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/custom-fixlet-relevance?sequence=2027216677
Thu, 04 May 2017 16:38:53 GMT bf:bfetl:debug Updated CUSTOM_FIXLET_RELEVANCE 2 rows in 0.093 seconds (21 rows per second)
Thu, 04 May 2017 16:38:53 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/dashboard-data?sequence=2027216677
Thu, 04 May 2017 16:38:54 GMT bf:bfetl:debug Updated DASHBOARD_DATA 13 rows in 1.813 seconds (7 rows per second)
Thu, 04 May 2017 16:38:54 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analyses?sequence=2027216677
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSES 23 rows in 0.062 seconds (370 rows per second)
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analysis-fields?sequence=2027216677
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSIS_FIELDS 51 rows in 0.125 seconds (408 rows per second)
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analysis-properties?sequence=2027216677
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSIS_PROPERTIES 58 rows in 0.297 seconds (195 rows per second)
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analysis-property-translations?sequence=2027216677
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSIS_PROPERTY_TRANSLATIONS 570 rows in 0.469 seconds (1215 rows per second)
Thu, 04 May 2017 16:38:55 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analysis-relevance?sequence=2027216677
Thu, 04 May 2017 16:38:56 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSIS_RELEVANCE 85 rows in 0.203 seconds (418 rows per second)
Thu, 04 May 2017 16:38:56 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-analysis-translations?sequence=2027216677
Thu, 04 May 2017 16:38:57 GMT bf:bfetl:debug Updated EXTERNAL_ANALYSIS_TRANSLATIONS 220 rows in 1.000 seconds (220 rows per second)
Thu, 04 May 2017 16:38:57 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-actions?sequence=2027216677
Thu, 04 May 2017 16:41:14 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_ACTIONS 44309 rows in 137.735 seconds (321 rows per second)
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-action-settings?sequence=2027216677
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_ACTION_SETTINGS 0 rows in 0.031 seconds
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-action-settings-user-groups?sequence=2027216677
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_ACTION_SETTINGS_USER_GROUPS 0 rows in 0.032 seconds
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-action-success?sequence=2027216677
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_ACTION_SUCCESS 0 rows in 0.031 seconds
Thu, 04 May 2017 16:41:27 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-action-translations?sequence=2027216677
Thu, 04 May 2017 16:44:26 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_ACTION_TRANSLATIONS 321510 rows in 178.610 seconds (1800 rows per second)
Thu, 04 May 2017 16:44:33 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-fields?sequence=2027216677
Thu, 04 May 2017 16:45:33 GMT bf:bfetl:debug Updated EXTERNAL_FIXLET_FIELDS 121340 rows in 60.475 seconds (2006 rows per second)
Thu, 04 May 2017 16:45:37 GMT bf:bfetl:debug GET https://servername.xx.vv.zzz:52703/api/etl/external-fixlet-relevance?sequence=2027216677

I’m seeing similar behavior in my v9.5.4 environment. We have 45k endpoints and my WebUI is installed on a different server.

Occasionally, the server will just “stall”. There could be 450 - 700 client check in files in the BufferDir folder.

Sometimes it resolves itself after 45 to an hour, other times I just reboot the server to keep information flowing.

When I opened a PMR, the first thing they asked me is did I have any overloaded relays (I have to relays that are known for accepting way too many clients, ala 4K each). I’m ordering new computers to act as additional Relays at the location that’s acting up.

I’m hoping that a scheduled upgrade to v9.5.5 with the ability to multi thread FillDB will help resolve the situation in combination with the new Relays. One thing to consider is that the Multi-Threaded features in FillDB are not active until you apply the “ParallelismEnabled” setting on the server and restart FillDB.

2 Likes

Tim… Im on 9.5.5 with ParallelisnEnabled set to 1 along with the other settings associated with the artical set on too.
We have a much smaller deployment here with 8K clients and 9 relays… One relay gets over 2000 clients periodically with another 2 having 1000+ so that may be related… I may bring up another in the same location as the 2K and see if that helps,

The root cause of the FillDB slow down could be this specific ETL step that takes about 40 minutes to complete. During these 40 minutes, very likely FillDB is stuck.
There two anomalies here:

  1. the number of transferred rows is very high for a deployment with 8K endpoint
  2. the insertion rate (443 rows per second) is very low

About the insertion rate, in our test environments (we use SSD disks), and in other customer environments, we observed insertion rates of 35K/40K rows per second, so about 100 times higher than the ones in your environment; your insertion rate is unexpectedly low.

The high number of transferred rows (bullet 1) could be due to the fact that the WebUI takes a lot of time to complete an ETL cycle in you environment and thus every time the ETL runs, lot of changes occurred on the server and have to be replicated in the WebUI DB.

Again looking at the log I can see:

Thu, 04 May 2017 10:49:50 GMT bf:bfetl:debug Updating statistics on COMPUTER_BASELINES
Thu, 04 May 2017 10:51:45 GMT bf:bfetl:debug Updating statistics on COMPUTER_FIXLETS
Thu, 04 May 2017 13:56:20 GMT bf:bfetl:debug Updating statistics on COMPUTER_ROLES
Thu, 04 May 2017 13:56:51 GMT bf:bfetl:debug Updating statistics on COMPUTER_SITES

The update of the statistics on the COMPUTER_FIXLETS table takes more than 3 hours, and this is the main contributor to the long time required by the ETL cycle. While the cardinality of COMPUTER_FIXLETS depends on the number of computers and the number of fixlets, in a deployment with 8K computers I do not expect it to be very large. And, even in larger environments, we never experienced such a long time to update the statistics.

There can be many different causes for this low performance of the WebUI DB. The disk subsystem could be the root cause, but also an high number of WebUI users (not sure how many operators use WebUI in your environment).

I would suggest to open a PMR so that we can use it to collect the performance data that we usually need to troubleshoot issues like this one and we can help to pinpoint the root cause for the slow insertion rate.

1 Like

This was my thought as well. Likely the storage used on the WebUI server where the WebUI cache is stored is too slow. It could also be that the storage and performance on the side of the database the root server uses is slow, or other things.

I don’t think the number of endpoints is related to the issue unless the majority of them are talking to the root server directly.

Did these “overloaded” relays get backed up with reports often? 4000 clients on a single relay should not be an issue if the relay is dedicated to being a relay and has fast enough storage / networking / etc. A lower number of clients per relay may help if the relay is getting backed up, but otherwise it will probably not help.

Related:

i never understood the concern with the Relays.

My concern was the stalled FillDB. Rebooting the server allows FillDB to process the waiting check-in files with no issues. Something is causing the FillDB service to stall out.

1 Like

I doesn’t make sense if FillDB has a bunch of pending reports to process and they are not getting cleared out at all and the same ones are sticking around.

It might make more sense if FillDB is consistently backed up, but still working through reports, just more come in as fast as they are processed, but this doesn’t seem to be what you are describing.

If you have a bunch of overloaded relays then they could be sending up lots of reports and never getting through them all quickly enough causing things to back up behind them, this effect could make it seem like FillDB is backed up as well when these relays are sending up lots of reports at once, though this still shouldn’t be an issue if FillDB is processing reports fast enough, other than an issue with the clients connect to the relays with the problem.

The drives on the server are all SSD, and even with 43k endpoints, there are rarely more than 20-30 files in the BufferDir folder. Surges might get up to ~100.

When FillDB stalls, there will be 600-800 files waiting to be processed.

If I try to stop the FillDB service, it fails to stop. Rebooting the server clears whatever is causing the hang up, and shorly after rebooting, the BufferDir folder clears out.

Short-term, I’m planning to solve the Relay issue by ordering some new hardware and deploy 10+ new dedicated Relays.

1 Like

This is exactly the same as I am seeing.
Typically a max of 20-30 files, with the occasional peak. Stopping Filldb fails. Stopping Webui service releases the buffer and folder without the need for a server restart or root server service restart.

While I dont have SSD;s, I do have fast drives on the server with lots of Ram and cores.

1 Like

The symptoms you describe do not sound like storage IO issues, but I would say there is no such thing as a “fast” spinning drive. The maximum IOPS of a spinning drive is around 200 while NVMe SSDs are over 1000 times faster at 200000+ IOPS. Disk Raid and IOPS Calculator - Expedient

1 Like

I’m starting to suspect Database conflicts. Something seems to lock a record on the server and FillDB doesn’t like it. WebUI?

1 Like

One thing you may also want to look into is to check to see if any of your operators are abusing the right click Send Refresh functionality in the console, see the following article for an explanation and knowledge/settings to avoid the problem:

http://www-01.ibm.com/support/docview.wss?uid=swg21688336

1 Like

In our setup, I deliberatelyturn off off the right click to many…
Im also not seeing any notify client forcerefresh actions …

Touch wood, no issues today… Im also noticing that the loginTimeoutSeconds configuration doesnt appear to be working for LDAP console operators but is for “local” operators. It did in the previous version.
It appears to also work for webui users… I have 23 Console users.

1 Like

That seems likely. Not sure if the WebUI ETL would cause that or not, or if something else would cause that.