Primary Relay Sizing - A General Discussion

FatScottishGuy · February 5, 2025, 3:56pm

I’m currently going through an exercise to re-evalute our relay infrastructure in line with some Data Centre moves and shift in where endpoints are now compared to when our infra was built.

It’s been a seriously interesting exercise but it’s apparent that the BigFix worlds we all built maybe only 5 or more years ago, are now in a very different place in terms of requirements.

Windows Patches have increased in size, RHEL patches have increased in size and in the quantity and of course we’re now patching things like middleware, apps and even more besides.

So, I’m curious, based on today’s patching needs from Windows and Unix based OS’s, what are you all doing in relation to specs for your Primary (Top Level) Relay servers? Do you feel they still meet your needs or have you had to increase their specs a lot over the past couple of years?

Assuming you want to cache the OS patches - how are you accounting for that now too?

I know there’s a sizing guide out there but this is more a general community feel and discussion I’m after than some set in stone guides.

SLB · February 6, 2025, 1:00am

Hi John, Doing a similar exercise here as we have had a significant reduction in onprem servers resulting in increase in connections to remote failover/regional relays and using workstations as relays aren’t reliable due to the nature of evening shutdown, power saving etc and just general coverage for the vast number of address spaces we have. We are looking at leveraging peernest as the workhorse for file distribution which should then reduce the client need to pull content from a relay thus reducing the time of clients consuming a relay connection and the patch binaries are pulled across the local LAN…so hopefully a win win.

Our top tier relays exceed 5k devices, not concurrent connections but we easily see 20k devices showing the same relay but that is covering the entire timezone spread. We also have to limit distance for auto selection to 3 hops due to connections to very remote sites that may be 4 or 5 hops are on expensive links and no way do we want office devices hitting a relay at the end of VSAT just because it had less hops than a relay that is in an office 2 blocks away but higher hops.

bigfix.mark · February 10, 2025, 5:43pm

Our relays are an area of very active investment right now. If you are looking at making changes, or adapting to new scenarios, feel free to contact me directly. We can have a talk around some of the directions we are taking, and how they may benefit you.

atlauren · February 10, 2025, 5:55pm

We’re currently redoing our relays as well. In order of motivating priority:

Replace CentOS relays with Windows, due to internal baseball.
Prepare for Enhanced Security, on the way to BigFix 11.
2.1 Provide a re-registration path for expired endpoints. Probably a failback scenario where the very last relay is unauthenticating, requiring on-prem/VPN networking?
Re-architect around home/vpn/on-prem IP spaces, replacing legacy “is/not laptop” logic.
3.1 Maybe put a relay in AWS?
3.2 Maybe use _BESClient_Download_Direct as well?

Looking at the Relay Health dashboard, I generally like to see cache cycling of about two months. I don’t have a hard reason for that, it just kinda feels right.

DerrickD · February 10, 2025, 8:37pm

Our environment is under 15k servers spit between 2 independent BigFix environments and limited number of datacenters. We plan to merge but it’s always put off for a future TODO every quarter for years now

Anyway…

I started out using TLRs as I liked that all clients from TLR down can stay connected together and I would only need to make a change at the TLR should I have a failover/issue on the master (this is vague and not perfectly explained but hopefully you can get the gist). Then over the years of supporting the product (2019+) I found I a big negative on TLR. When moving a child relay from one TLR to another TLR, it was not uncommon for BigFix sites to have errors syncing. The only fix is to reset the Relay Gather (https://support.hcltechsw.com/csm?id=kb_article&sysparm_article=KB0079078) which I wrote a script for. Combined with the fact that we just are not big enough client or relay wise, have a low number of deployments/apps/integrations, I have begun to break up my TLRs and just have relays reporting direct to the master. We have limited relays.

I allocated around 300-500gb freespace on my relays and it’s been more than enough space. If I had to focus on costs, I would consider shrinking to 200gb on relays, and the relay will just re-download from the master if required. But the number here really depends on how many OS platforms you support, the number and size of packages being deployed, etc.

I use automatic relay selection, except for situations like DMZ or where there are “suboptimal network configs” and have rules to prevent clients from going to the master servers. I also use DNS alias for many of my relays, but not all.

I also have fallback relay set in the masthead along with “_BESClient_RelaySelect_FailoverRelayList” as a policy action, and provide it a list the “TLR” relays.

Overall I have been really happy with this setup and other than ripping out the TLRs that really don’t benefit me, I don’t have plans to change much other than setting up Relay Affiliation groups some day (right now it’s just 1 big group effectively where I don’t advertise a few relays/master/dsa for automatics relay selection).

We’ve also talked briefly about the feature to remove relays and let clients get content from each other, but with our environment, I just prefer using Relays for various reasons (lets us monitor/control traffic, keeps disk space free on clients, minimize impacts to other servers/clients, etc).