Local & RemoteApps Console Performance Issues

We have 120 console operators (on v. 9.1.1117.0) and climbing. Our global infrastructure is 48k machines and 130 relays… The console has always been a little sluggish and it seems it is becoming even more so. There are probably never more than 30 console operators connected at any given time. We’re also planning an upgrade to 9.2.7 shortly.

Master server:
HP ProLiant DL380p Gen8 with 64GB of RAM and 2 x 3.3GHz processors (with 4 cores each)

RemoteApps server:
VM with 12GB of RAM and a 2.8GHz processor (with 12 cores)

  • What do we do in a situation where we simply have to add more console operators… is our hardware good enough or do we need to bump it up and/or add a second RemoteApps server and try to load balance?

  • Our console operators connect to the RemoteApps server which is on the same switch as the master server… However I use the local console on the master server (or one of our DSA servers)… would this have a measurable, detrimental impact?

Any and all advice welcome :smile:

1 Like

Hello!

I would start by looking at the current performance of your infrastructure – if you look at the utilized resources on your master server how is it doing? Do you have >50% CPU usage? >70% RAM usage?

In my experience, the best thing you can do to speed up your root server (and related connections) will be to improve the disk performance. If you can move your database to a set of NVMe (or sas/sata) SSDs in RAID 1 (or RAID 10) you will have the fastest BigFix server in town for a relatively low price (<$5000). Normally your server is limited by disk and it shows in console performance.

Next I’d make sure I have a 10gbe connection between my root, top level relay, and console server. The top level relay should be the only thing connecting directly to the BigFix server – this will have a nice performance impact!

If you’re using Encryption (check in the BES Admin tool) I’d move the encryption key to the top level relay so that the root server isn’t performing the decryption with its own CPU cycles.

Also, given that my consoles use about 1gb of memory each when open, you will need to bump up the RAM on your RemoteApp server probably by an order of magnitude.

I’d raise the time between refreshes in the console (login as master operator, file > preferences) to over a minute or two.

Finally, I’d institute a timeout on user sessions on the remote app server and in the BigFix Administration Tool allowing someone to stay idle for at most ~30 minutes. You can have 300 operators seem like 30 if you can control the number of concurrent operators you’ve got.

2 Likes

Have you took a look into BigFix WebUI Feature?

Maybe you could grant access to some users to WebUI and leave the console just for those users than can not perform their activities trough WebUI.

Hey @strawgate,

Our resources are under pressure. Memory is at least 90% at any given time, and CPU from 40-70% generally, although it does spike higher.

Our console refresh rate is 60 seconds (minimum) and we’re not using encryption. I have more RAM waiting to be installed in our root server; it’s just finding someone local to actually do it. But that should happen soon.

I’ll definitely look into SSDs and verify our root and console server connectivity speed.

I think the main part of our problem is the fact that while we have a lot of relays, we also have business units who don’t have relays and point directly to our root/master server. The problem is, it’s hard to tell how many, since we have thousands of clients with modified hosts file translating the root server/masthead hostname (which they cannot resolve) to a local relay IP (which they can access in their own environment). Unfortunately, it seems like the BigFix console cannot tell the difference between machinates with this hacked configuration, and machines which are actually pointing directly to the root server.

I have had interest for some time in employing a top-level relay (which we currently do not have), but am not sure how to best accomplish this as I have received differing advice. After several years, our root server IP is finally known to pretty much all of our global firewalls, and access is granted over the required port. Requesting access for a new (top-level relay) IP would be a job and a half, as a lot of our firewalls are individually maintained. Machines that cannot resolve our root server DNS alias either add an internet-facing failover relay setting (if they have internet access), or modify their hosts file. So if i want to “promote” one of our relays to be a top-level relay:

  • Can I just steal the IP from the root server, and then re-IP the root server to an unknown IP?
  • Should I associate the bfxmaster (root) DNS alias with the new top-level relay also? (And just leave the root server accessible to the top-level relay only?)
  • We currently have SQL on our root server too… is it ok to leave it there?

I’m sure like most places, our environment is complex and big changes are not easy to make. Any advice regarding this specifically would be appreciated as I am still fairly new to BigFix administration.

@fermt - we are on BigFix 9.1 so we do not currently have the web console. We should be upgrading to 9.2.7 soon so would be able to implement WebUI 1.0… not sure how good it is compared to 9.5.1.

Thanks all.

Thanks a million @strawgate… this has really helped clarify the top-level relay options. I think I prefer the fake root option as although pointing all relays to a top level relay one-by-one would (eventually) work, it would not eliminate the clients that do not have access to local relays, and are connecting directly to the root.

I’ve created the hosts file manual entries analysis so we’ll see what it comes back with… A mess, I imagine… :slight_smile:

Currently, our client heartbeat is every 20 minutes. I’ve increased the console refresh rate from 60 to 90 seconds, and will continue to increase bit by bit (if there is no uproar) until we’re at 3 minutes. The RAM, and better hardware - I’m working on. It’s slow going though!

There are some settings that we “push once” in our environment, i.e.:

  • Set BESClient_RelaySelect_FailoverRelay (to our internet relay)
  • Set CommandPollIntervalSeconds = 1200
  • Set MinimumAnalysisInterval = 300

There has been quite a shift in our organisation recently with a lot more people working remotely (as I do). When I push an action to one of my test laptops here, it can take up to an hour for it to action. Probably since we now need more DMZ/internet facing relays (which I am in the process of requesting). I’m worried that increasing the command polling and minimum report rate will cause actions to take even longer… Am I picking this up wrong?

This is way too low for your environment – this should be set to between 2 and 12 hours. 20 minutes is fine for a set of test devices but that’s way too high for the rest of the environment especially with devices talking directly to the root server.

This won’t really help your issue. In BigFix there are two ways a client finds out about an action

UDP Pings

A TCP message goes from the root server to the relay, from the relay to any other relays in line.
The parent relay of the client sends a UDP message to the client notifying it of new content.

If you are using UDP Pings then clients will find out about a new action and start processing in <1 second. This is what you want.

Verify that TCP 52311 messages can go from Root → Relay → Relay → Parent Relay of Client
Verify that UDP 52311 messages can go from Parent Relay of Client → Client

An important thing to note is this traffic is already happening in your environment every time you change something. It just happens that you’re blocking it at some point. Fixing this won’t increase traffic, it’ll just stop denies and make your clients more responsive!

If UDP pings are blocked AT ANY POINT you end up with command polling.

Command Polling

The client checks at a pre-determined interval for new content (in your case every 20 minutes). It makes this request regardless of whether there is new content or not.

This is what you’re seeing.

Your best bet will be to setup a relay at the site where you work and make sure it’s got TCP 52311 inbound allowed from the root server and then UDP 52311 inbound from the relay to your test clients. Once you do this you’ll be amazed at how fast your clients respond. For actions without downloads we sometimes see faster than one second response times for the client to start processing the content.

Can you let me know what the, “_BESClient_Report_MinimumInterval” client setting is in your environment? You should consider changing it to be >90s

1 Like

Hmmm, it looks like the “_BESClientReport_MinimumInterval” client setting has never been set in our environment. I’ll get approval and set it to 120s.

Ok, so I need to work on verifying we don’t have any TCP/UDP access issues in our environment (and fixing them if we do) rather than worrying about adding extra relays. I don’t see any fixlets or analyses offhand that can help with that… is there a better way than manually checking TCP access - which I presume has no issues - from the root to each of the relays (e.g. via telnet) and then requesting that the relay owners check UDP from the parent relays (which I have no access to) to the child clients? I understand that sending a refresh will only be successful if UDP is open, but that’s a pretty manual check too. I could always ask our networking teams to verify that UDP for that port is open in our global firewall infrastructure, but as I mentioned previously, we have a lot of local/resource firewalls also.

Thanks again for all of your assistance, and apologies if the above query is something obvious!

Links to the performance and capacity paper in this post

2 Likes

This analysis may help: https://bigfix.me/analysis/details/2998021

Specifically the, “client - last UDP ping - universal”

That property will be the last time a client received a ping.

It’ll be a manual process to verify but what I would do is:
Look at a test client on a network.
If it gets UDP pings you’re done!
If it doesn’t, look at the relay in bigfix, does the relay get pings?
If it does check the udp path between the relay and client.
If it doesn’t, look at its parent relay in bigfix, does it get pings?
If it does check the TCP 52311 between the relays
If it doesn’t, keep going up till you hit a relay that does or you hit the root.

You can also test the relays going down from the root.

Start at the root and try to access http://relay:52311/Rd on the child relay.
If that works you’ve got 52311 open, if it doesn’t, fix it.

Go to that relay and try to access http://childrelay:52311/Rd
If that works you’ve got 52311 open, if it doesn’t, fix it.

1 Like

Thanks @gearoid! I’ll download this and have a read of it.

@strawgate; So it would make sense for the most part that I can narrow my testing down to clients that have never received a UDP ping, right… then group them by relay and start checking from there?

The http://relay:port/rd pages are really useful! Interesting thing is, from my spot checks from the root server, several of our relays just return “This page can’t be displayed” which is a little worrying. (No proxy issues or anything.) Yet, from my own laptop, those pages resolve.

Similarly; I presume I should be able to telnet from the root server (over the BigFix port) to all relays, right?

1 Like

You’ll be able to reach it from your laptop because that’s how the besclient communicates!

That means the firewalls allow traffic from clients -> relays -> server.

The traffic going the other way server -> relays -> client is what we need for UDP pings so if your clients are not getting UDP pings and you can’t access that Rd page from the root server, there is your answer :slight_smile:

I don’t typically use telnet but I’d imagine it would work!

I’m not surprised that I can reach it from my laptop, but I’m really rather surprised I can’t reach the relay pages from the root! I reckoned at least from root to each relay would be setup as required. This is good to know, but not what I wanted to hear :wink:

Looks like I have some work ahead of me, but at least now I have a strategy… Time to get the whip out. I really do appreciate all the input though!

1 Like

That would depend on if you have relays in hierarchy or not. For example you could have relays that are only addressable from the relay above it in the hierarchy and not from the root

1 Like

@gearoid No hierarchy…

What we frequently do is associate a small Subnet with BigFix Infrastructure (a /27 or so) and allow that subnet through the firewalls – this makes it easier if you need to setup failover, a top level relay, etc as the whole subnet has 52311 open instead of just one magic IP

1 Like

Yeah, I believe we have something similar but need to check. I know there aren’t many free IP address around that range though… Fingers crossed we have the range open and a free IP sitting waiting for me! :slight_smile:

I’m surprised your having such difficulty with only 30 concurrent operators. I would definitely recommend upgrading the FIllDB and SQLdb storage to NVMe SSDs. (Intel P3700 is an example)

Make sure you have good, ongoing backups.

The more concurrent operators you have, the more issues you will have with console performance.

I think given your environment, it makes sense to use the Fake Root option.

I agree with @strawgate this would be too aggressive. I would set command polling on all systems to something like every 3 to 6 hours, but set it to 1 to 2 hours for systems not getting UDP. I would also work on getting UDP to work as much as possible, which you are already doing.

See this example: https://bigfix.me/fixlet/details/3798

It turns out my relevance isn’t quite right for detecting UDP. (relevance #3) It should be the NOT of this: https://bigfix.me/relevance/details/3017512

See this example: https://bigfix.me/analysis/details/2994561

This all makes sense. I had the RAM on our RemoteApps VM bumped from 12GB to 16GB and that has actually helped, but as already pointed out; not near enough RAM. But we are not being allocated any more RAM due to resource shortages in the VI (which I don’t manage). I think I need to look for a RemoteApps server with more suitable specs. Much more of a struggle than it should be, unfortunately.

Regarding RA, I have noticed that if console operators RDP directly to the RA server - rather than connecting via the RemoteApps URL - it seems to consume more memory. Is there a way to prevent users from doing this? The RA AD user group is a member of both “Remote Desktop Users” and “IIS_IUSRS” on the local RA server.

Hello!

This technet post covers a potential solution: https://social.technet.microsoft.com/Forums/office/en-US/5d17f131-c6d1-49dd-b0b7-83c03c3fedbb/how-to-disable-remote-desktop-access-but-allow-remoteapps-to-run?forum=winserverTS

As there is no native solution you can do:

In this moment, run the following procedure:

- Go to the RDP properties on Terminal Services Configuration Console.

- In environment tab, select the option "Start the following program when the user....":

Path: c:\windows\system32\logoff.exe

Start in: c:\windows\system32

That’s interesting @strawgate, however Terminal Services Configuration Console was removed in Windows Server 2012 (which I am running). However it looks like the exact same thing can be done via group policy for WS2012:

I’m worried that I will snooker myself here… this server is in a datacentre so the only way I can access it myself is via RDP. By enabling the logoff for any connection, there’s no way to exempt myself, is there?! :slight_smile: