Webui and Filldb bogging down,

This was my thought as well. Likely the storage used on the WebUI server where the WebUI cache is stored is too slow. It could also be that the storage and performance on the side of the database the root server uses is slow, or other things.

I don’t think the number of endpoints is related to the issue unless the majority of them are talking to the root server directly.

Did these “overloaded” relays get backed up with reports often? 4000 clients on a single relay should not be an issue if the relay is dedicated to being a relay and has fast enough storage / networking / etc. A lower number of clients per relay may help if the relay is getting backed up, but otherwise it will probably not help.

Related:

i never understood the concern with the Relays.

My concern was the stalled FillDB. Rebooting the server allows FillDB to process the waiting check-in files with no issues. Something is causing the FillDB service to stall out.

1 Like

I doesn’t make sense if FillDB has a bunch of pending reports to process and they are not getting cleared out at all and the same ones are sticking around.

It might make more sense if FillDB is consistently backed up, but still working through reports, just more come in as fast as they are processed, but this doesn’t seem to be what you are describing.

If you have a bunch of overloaded relays then they could be sending up lots of reports and never getting through them all quickly enough causing things to back up behind them, this effect could make it seem like FillDB is backed up as well when these relays are sending up lots of reports at once, though this still shouldn’t be an issue if FillDB is processing reports fast enough, other than an issue with the clients connect to the relays with the problem.

The drives on the server are all SSD, and even with 43k endpoints, there are rarely more than 20-30 files in the BufferDir folder. Surges might get up to ~100.

When FillDB stalls, there will be 600-800 files waiting to be processed.

If I try to stop the FillDB service, it fails to stop. Rebooting the server clears whatever is causing the hang up, and shorly after rebooting, the BufferDir folder clears out.

Short-term, I’m planning to solve the Relay issue by ordering some new hardware and deploy 10+ new dedicated Relays.

1 Like

This is exactly the same as I am seeing.
Typically a max of 20-30 files, with the occasional peak. Stopping Filldb fails. Stopping Webui service releases the buffer and folder without the need for a server restart or root server service restart.

While I dont have SSD;s, I do have fast drives on the server with lots of Ram and cores.

1 Like

The symptoms you describe do not sound like storage IO issues, but I would say there is no such thing as a “fast” spinning drive. The maximum IOPS of a spinning drive is around 200 while NVMe SSDs are over 1000 times faster at 200000+ IOPS. Disk Raid and IOPS Calculator - Expedient

1 Like

I’m starting to suspect Database conflicts. Something seems to lock a record on the server and FillDB doesn’t like it. WebUI?

1 Like

One thing you may also want to look into is to check to see if any of your operators are abusing the right click Send Refresh functionality in the console, see the following article for an explanation and knowledge/settings to avoid the problem:

http://www-01.ibm.com/support/docview.wss?uid=swg21688336

1 Like

In our setup, I deliberatelyturn off off the right click to many…
Im also not seeing any notify client forcerefresh actions …

Touch wood, no issues today… Im also noticing that the loginTimeoutSeconds configuration doesnt appear to be working for LDAP console operators but is for “local” operators. It did in the previous version.
It appears to also work for webui users… I have 23 Console users.

1 Like

That seems likely. Not sure if the WebUI ETL would cause that or not, or if something else would cause that.

The procedure we usually use to verify if the WebUI is slowing down FillDB, is to correlate the WebUI ETL log with the FillDB performance log. If you do not have the FillDB performance log enabled, I would suggest enabling it and waiting for the issue to reoccur.
FillDB writes very frequently in the Performance log, unless it is stuck waiting for some lock to be released. So the procedure to troubleshoot the issue is:

  • as soon as you experience a report increase in the FillDB buffer directory, you can check the FillDB performance log and verify if FillDB is writing logs or if it really stuck
  • check the WebUI ETL log and verify if any data transfer is occurring between the WebUI and the server. You should see a line like:
    bf:bfetl:debug GET
    This means the WebUI has started requesting data and is processing them
  • wait for the WebUI to complete this ETL call. You should see in the log something like:
    bf:bfetl:debug Updated <table_name> <rows_number> rows in seconds ( rows per second)
  • check again the FillDB performance log: if FillDB has restarted writing logs soon after the ETL completed, it means that the WebUI ETL had locked some data and prevented FillDB from updating them.

We have experienced in the past FillDB slowdowns related to the WebUI ETL and we are working to solve them with a future update.

1 Like

Ok… its Bogged down again…
Im going to open a PMR with this…I may have to shut down Webui totally

1 Like

Apparently there is an issue with ParallelismEnabled particularly if the number of FillDB threads exceeds CPU cores available which could cause FillDB to stall.

I would try disabling it and look to enable it with more conservative thread counts once your issue is resolved.

How many CPU cores does your root server have? Does the root server have a local or remote database?

@ jgstew
The Server has 2 processors with 16 cores and 32 logical processors… 32G ram … Local SQL2014 Database.

I have just changed the Registry entry for ParallismEnabled to 0

Will restarts all the services and lets see what happens.

Well turning off ParallismEnabled didnt fix it… after about 6 hours of waiting for the Webui to initialize, the FillDbdata Bufferdir filled up with 722 files and stopped. all 8K machines in the master console greyed out … Stopping the Webui Service then flushed the BufferDir and everything burst back into life .

It is typically easy to see the impact of parallelism through the system monitoring. While disabling is good to rule things out, it has been very good at being self adapting.

In this case, I would suspect lock contention and a generally slow ETL process.
What does the Webui\ETL directory look like over time?

Thx.

1 Like

I’m confused, are those the right places to configure parallelism? I thought they were regular client settings based upon this:

But I guess the parallelism settings are different given that they go into besserver.config on linux and not besclient.config which I find a bit unusual. It kind of makes sense, but client settings are generally a universal way to work with configuring bigfix across multiple OSes, so I tend to prefer them.

We see this today with 9.5.13. We kill and start the fillDB and/or BES Root service on the root server. About 15 minutes goes by after restarting and poof, the files process again.

I have opened a case with HCL.

Really tired of this behavior.

We’ve had this issue through many versions and cannot pin it down. The short version is that FillDB will hang, you can only kill it and then restart it. What we have found is that it almost always happens a few days after a reboot of the BES. If after a restart of the BES, you later restart the FillDB on your own, it seems to run fine.

Check the space available on your DSA DR setup target. Our DR DB was bloated twice the size, had a case open for over a month to find out why, vendor never came back with an answer. We could not keep enough free space. Just kept adding and adding more disk. We stopped and disabled fillDB service on DR targets, then problem went away and a lot of other issues we were facing.

We will schedule a DR setup redo (drop DB) soon. Then we will watch the size again to see if there is a true defect. We then may just switch to another non-Bigfix replication solution, which there are a few.