General setup question - Roots vs Relays

Hi,

Looking for advice on how we go about structuring our BigFix setup as far as Root Servers and Relays. Our main goal is primarily focused on Patching servers of various OSes.

  • We have somewhere around 5,000-7,000 servers that we want to manage. They are a mix of Windows, Linux, and AIX.
  • We have two physical data centers (Florida and Colorado). One is Prod and one is Non-Prod.
  • We have servers in various AWS locations (US East/West, Europe, Asia Pacific East, South, Southeast)
  • We have a small number of servers in Azure

We want the system to be fault tolerant in case a server crashes or a data center dies. I went through the BF 101 training and then mentioned it could be installed as a Windows Failover Cluster for High Availability but we didn’t go into any details. Would we want to have two BF Root servers in one of the data centers and then have Relays in other locations? Would we want two Root servers in Colorado and two in Florida, as part of a cluster, in case a whole data center dies? Or maybe just one in each of those locations?

For Relays, would we want to consider relays in each data center and every AWS location? Or maybe just one for AWS US, EU, and APAC but not all the East/West/South options?

We don’t want to go overkill, but want things to be as efficient as possible, so just looking for any recommendations so we do it right the first time. :wink:

Thanks.

I am going to assume in the BigFix 101 class they were speaking of a DSA (Distributed Server Architecture) where you have the MASTER BF Server and a failover one. In this case, the placement of the server (logically) should be across datacenters. However what must be taken into consideration are the network speeds and latency.

As far as relays go, as a general rule it does not matter if your locations are in the CLOUD or remote offices. Relays should be considered in any area (Local Lan) where the numbers of endpoints is over ~10-20 devices and the network link is small. There is NO (software) cost to create relays and as long as You have a solid policy in place for endpoints to FIND the best relay and possible failover relay you should be good.

Your question:
For Relays, would we want to consider relays in each data center and every AWS location? Or maybe just one for AWS US, EU, and APAC but not all the East/West/South options?

It is hard to answer because the information about how many endpoints are in these locations is not posted. Again, in general, I’d follow the above rule on relay placement. Regardless of CLOUD provider there is a “network cost” (speed, latency, etc.) and a monetary cost to move data around inside the cloud (which will vary per provider and your plan. But lets say that on average you need to distribute 100MB of patches (low balled) per month. And you have a SINGLE relay in your AWS US environment. If you are servicing 1,000 endpoints in the AWS/US and let’s say that of those 1,000 endpoints only 250 are close to the relay, then you are moving 100MB to ~750 endpoints (`73GIGS) inside the AWS space which may COST you in both terms of NETWORK and MONETARY cost.

I have a client that leverages relays in AWS and AZURE to service 10’s of thousands of endpoints on the internet. They FIND the closes relays on the AZURE cloud by location (US-East / US-West/ etc.) as their primary relay, and the TOP level relay in AZURE if the closest one is unavailable.)

I would also stress that as You setup Your relays (no matter what you choose) to make sure the endpoints can logical and automatically connect to the best relay possible given the amount of endpoints you have. I would do this with a POLICY action in BigFix. Since these endpoints (from your thread) are servers and not desktop/laptops they should not be mobile and setting them up as MANUAL connection to their Primary/Secondary and failover relays is fine. If You choose to use the AUTOMATIC method of relay selection, ensure you set the appropriate Relay Affiliations on both the RELAYS and the Endpoints.

I hope i didn’t confuse you with this (I am replying rather fast as I have a meeting to get to) but if you’d like I’d be happy to talk with you about : dan.powers@cdw.com

1 Like

Your environment is very similar to ours…

My recommendations

  1. Build a test environment, learn, break it, rebuild from learnings. About your 3rd environment will be where you apply all the learnings.

  2. When HCL says you can do high availability using clusters… they never tell you how. It’s sort of left up to the customer because every customer has different solutions to do that. Such as vmware where it keeps two VMs in sync across datacenters, the whole Windows Cluster thing, and other solutions. Our environment does not offer these capabilities for a few reasons, and we wanted full HCL support (especially being new to BigFix). So we went with DSA as it’s the only official supported solution by HCL. Some advisors will tell you it is going away, however we have been using it for over 3yrs now and the story has not changed… DSA is not going away anytime soon and HCL will give a transition period I am sure.

  3. For DSA, you can spec two similar servers… one will act as the primary and replicate to the secondary. These are two fully operational servers. However, if you take an action on the secondary, nothing will happen until they sync… so expect a 10+min delay. You then set your relays up to have primary/secondary to be the master and DSA so they fail over automatically… clients would never know the master server changed, as that’s handled at the relay level.

  4. As far as relays, you want as many as you need to allow for 1,000 - 5,000 clients max… it really depends on how much workload you have to determine the true max in your environment. We stick to about 1,200 max… but that’s just by luck of the draw for clients in environments and failover relays it just ends up being that way. Don’t be to concerned here, just learn as you go. Relays are basically disposable and you can stand up and replace as needed.

  5. Use DNS alias as much as possible. It gives you flexibility to change servers out without changing configurations, etc. I also highly suggest a dedicated alias to the failover relay /relay list for your masthead.

  6. Go with secure TLS 1.2 out of the box, you have to enable it

  7. Backup and save everything with notes as you go through the process. Passwords, decryption keys, etc.

  8. We set up our environment with TLR (Top Level Relays). I basically have the master > 4 TLRs > ~12 relays. If I had to do it again, with our environment been only 15k, I wouldn’t do TLRs… I think it’s overkill. Mainly because we found there is a draw back, because if a relay is a child of another relay, and it switches to a different parent… you have issues with the site caching and have to run a script (that I developed based on documentation from HCL) to “reset” the relay. It works when it works, but about once a year I have to do the reset when I make some configuration changes. I have plans to just break that down to only 1 level of relays at some point.

  9. Also 100% setup a test environment because you will want to test upgrades, and occasionally they break with DSA during upgrades. Usually an easy fix, but this gives you time to figure it out without the stress in Prod. We have 3. The first is a basic single server Dev. Then a QA that “mirrors” production, with DSA setup, a relay structure, etc.

1 Like