BigFix Architecture/Performance

dgendera · April 17, 2024, 2:39pm

We have environment that contains of the following (everything is hosted in Azure)
Windows
1 Root Server Windows (Server 2016 | SQL Server 2016 co-hosted)

Standard E32ads v5 (32 vcpus, 256 GiB memory)
CPU avg. 25-40% | Memory Avg. 40-60%
BFEnterprise 450GB
Disk Latency 1.5ms - 3ms (Azure monitoring) (Premium SSD with bursting enabled SQL Data disk)
TempDB / Log DB / Data DB are all on seperate disks
1 Console Server (Server 2016)
1 WebReports / WebUI Server (Server 2016 | SQL Server 2016)
BigFix v10.0.46.x for Root/Relays/Clients

Linux
3 Top Relays
7 Regional Relays
4 External Relays

140k client (Windows/Mac/Linux Flavors)

clients report back every 30’
PeerNest enabled

Other

30-40+ Console Operators
avg. 1800 actions
55 analyses (300-400 properties)
automated cleanup of computers/actions/baselines
DB (BFEnterprise) Re-indexing (Daily)

We’re experiencing slow performance while working in the console simple tasks like creating actions, baselines, checking on computers can take several minutes
to complete during which console becomes non-responsive, and this becomes very frustrating for the operators. This is very noticable during our Patching cycles (Patch Tuesday and following week).
Some of the tasks can be done through WebUI but not everything and operators are now used working in the console.

We’re looking at upgrading our Hardware (Root Server) sometime later this year focus will be

Premium SSD v2
Upgrade BigFix version to v11 - 11.02 or later

We’re also working HCL BigFix Support and have followed best practices according to the Capacity planning guide. We have also shared date from the Performance toolkit with HCL Support and there it’s shows that main culprit is Disk latency / disk queueing/waiting. HCL Support want to see latency of 1ms or less.

Want to understand how people have configured similar environments and more specific around Root Server/SQL DB and if they also experience this slowness while working in the console. We have automated some of the tasks using REST API (creating content, initial actions, groups) and for sure more can be done in that
area but don’t want to spend time on developing solutions if console is readily availalbe that can do all of this.

Also want to know if other companies (if willing to share) already have experience with these Premium SSD v2 disks and BigFix v11 of which we have been informed that console performance should be better (up to 30% performance increase)

Any other tips/tricks more then welcome

atlauren · April 17, 2024, 5:24pm

Sounds like you’re pursuing the right avenues.

Our installation has always been in VMware within a local datacenter. Some years ago we split the database to a remote, dedicated SQL Server to please our DBAs. (They had found evidence of locking when it was colocated on the same VM.) Within the VM infrastructure, the BIgFix root and SQL Server VMs are co-located onto the same physical host.

I would also suggest:

Interrogate all additional tools (AV, EDR, etc) to validate that their exceptions are correct and being observed.
Prune the action list as needed.
Make sure operator roles only subscribe to the sites they need.
Make sure the machines where folks run the Console are sized appropriately.

DerrickD · April 17, 2024, 5:37pm

Just to check, is this console slowness issue all the time or is it an occasional freeze and then sometime your active window changes to another BigFix console window?

If it’s the later, as far as I know it’s a known issue and has no real fix. It’s just how the console is. Making sure you are on a decent spec’d host, good DB connection, antivirus whitelisted for app folders, and setting the console to refresh 3+minutes is the only combo that helps. Sometimes when I am in the console building stuff, I even set the refresh way out to say 20min. I also find some windows devices just run it better than others, like of all things our master server itself runs the console slower than pretty much any other device.

As far as our size, under 10k devices, about 15 relays, and we actively have 5000-150,000 total actions with 3k - 15k parent actions in the console. I start seeing noticeable slowness in general once we cross 100,000 to 150,000 total actions. Closer to 400k is where we actually see issues. And yes, I am fully aware it’s not good to have that many actions, we do not use BigFix correctly

jbruns2017 · April 17, 2024, 6:21pm

Its a known issue and v11 is being worked on to make console more responsive. We use NVM/e disk for root and SQL and makes no difference with console response on v10.

Getting into baselines and group management is awful. 3-5 minutes to get a response and we see “not responding” in title of window ALOT.

We are upgrading to v11 next month, so after that is done and we get some updates to console code, nothing will be changing. We might be getting some v10 updates for console but that remains to be seen.

Development does not want to back port the changes on v11 to v10. A lot of work is what we hear.

Of course, what we really want is this, HCL Software - Sign In

Vote if you have not.

ageorgiev · April 19, 2024, 8:49am

A lot of the suggestions were mentioned above but to rehash them:

How is Console caching configured and is it being used at all? We have used moderate caching for years and never had any issues.
What is the consoles location compared to root server? Essentially, what I mean is bandwidth… We had terrible console issues when we attempted to allow operators in APAC for example to connect to root server in US because the bandwidth is just too slow to handle it. We solved the issue by putting all consoles on citrix servers in the same location as the root server - operators just launch their citrix session which has shared console app for them, but the traffic between root server-consoles is instantaneous!
AV & Network Traffic scanning software - both should have exclusions to NOT scan Console cache folder and/or Console cache traffic. I have seen AV scanners KILL console performance!
Console usage - there are certain properties that may return a ton of data (especially, some of the BFI default ones but you may have custom ones too), make sure those are not something operators display in their consoles. Imagine you have a property that is returning 100mb of data per endpoint and you have 15 sec console refresh rate - it may just not be enough to refresh it all and the refresh can freeze console until it completes. The way we have dealt with these is the analysis that are not meant to be viewed in the console - activate them with NMO account which only activates them “locally” - i.e. clients still see them and evaluate them BUT other operators (apart from MOs) can’t see them in the console.
Console refresh rate - well, you can reduce it. For example, we have it at 10 mins - if someone is really in a hurry they can click “refresh console” faster but in the general case operators would submit an action and come back to it at which time it would have updated already anyway. As a rule of thumb here I would say - think about how much data is your console loading and can it realistically be expected to update it at the frequency that you have the console refresh rate assuming you have X number of concurrent console users, especially IF you don’t have caching enabled.
One less common one but still worth considering - do you know what people are running as RestAPIs against your root server? I have seen badly written session relevance kill WR and root server performance! There have been some improvements added, so it really shouldn’t be happening but still quite a sensitive thing imho…

dgendera · April 19, 2024, 9:20am

Few replies to your questions
Console Caching → Keep partial cache on disk
Console location → same Azure region/zone as the Root server
AV & Network scanning → For AV anything BigFix Enterprise has been excluded we have also been told several time that our AV Soluiton (CrowdStrike) does not do any file locking has different mechanism for checking file(s)
Console Usage → not aware of properties collecting large amounts of data, probably can do some housekeeping as we’re not any different then other companies where we like to create properties/analysis and forget to delete/clean-up when not needed anymore
Console Refresh Rate - if you refer to setting “Refresh list every xxx seconds”, it’s currently set at 90seconds, but I’ve changed it to 900 seconds and see if that makes any difference, these are user settings, so we can play around with some settings and then mabye have BigFix Policy action to set these for every console user
REST API → I’m not so much concerned as we have few REST API call that are retreive data from BigFix for our CMDB solution, and these are written according to best-practice from BigFix Experts that posted in the forum before.
We do have 1-2 SQL queries going for some PowerBI dashboards but these are also written with good practices/standars (ex. nolock of tables etc.)
We also have ServiceNow adapter configured that will read data from Service Now and update BigFix, it runs 2/wk and every run takes 1-2 hours to complete.

What I’m mainly looking for is that when we upgrade our hardware later this year and also upgrade to V11 we can see some overall improvement in console performance, therefore also this post to get some more best practices.

Thx to you all for responding with tips/tricks/advice much appreciated.

DanieleColi · April 19, 2024, 1:48pm

About Console refresh, there is an advanced option you can use to set minimum refresh for all consoles: MinimumRefreshSeconds

ageorgiev · April 19, 2024, 5:40pm

You seem to have covered the basics then, next step would be to open a support case - most likely support will ask you to enable console debug logging and wait for one/a few of those freezes/slowness and see what the console is doing at the time and what is delaying it.