The root cause of the FillDB slow down could be this specific ETL step that takes about 40 minutes to complete. During these 40 minutes, very likely FillDB is stuck.
There two anomalies here:
- the number of transferred rows is very high for a deployment with 8K endpoint
- the insertion rate (443 rows per second) is very low
About the insertion rate, in our test environments (we use SSD disks), and in other customer environments, we observed insertion rates of 35K/40K rows per second, so about 100 times higher than the ones in your environment; your insertion rate is unexpectedly low.
The high number of transferred rows (bullet 1) could be due to the fact that the WebUI takes a lot of time to complete an ETL cycle in you environment and thus every time the ETL runs, lot of changes occurred on the server and have to be replicated in the WebUI DB.
Again looking at the log I can see:
Thu, 04 May 2017 10:49:50 GMT bf:bfetl:debug Updating statistics on COMPUTER_BASELINES
Thu, 04 May 2017 10:51:45 GMT bf:bfetl:debug Updating statistics on COMPUTER_FIXLETS
Thu, 04 May 2017 13:56:20 GMT bf:bfetl:debug Updating statistics on COMPUTER_ROLES
Thu, 04 May 2017 13:56:51 GMT bf:bfetl:debug Updating statistics on COMPUTER_SITES
The update of the statistics on the COMPUTER_FIXLETS table takes more than 3 hours, and this is the main contributor to the long time required by the ETL cycle. While the cardinality of COMPUTER_FIXLETS depends on the number of computers and the number of fixlets, in a deployment with 8K computers I do not expect it to be very large. And, even in larger environments, we never experienced such a long time to update the statistics.
There can be many different causes for this low performance of the WebUI DB. The disk subsystem could be the root cause, but also an high number of WebUI users (not sure how many operators use WebUI in your environment).
I would suggest to open a PMR so that we can use it to collect the performance data that we usually need to troubleshoot issues like this one and we can help to pinpoint the root cause for the slow insertion rate.