Relay Overload Issue

We have an big environment where 30 thousand endpoints are reporting but the issue in our Mid Level relay whose overloaded and client selection is manual towards to relay.

I thought reley affiliation help on this but area but found so many subnet are existing and in that case tuff create so many properly for affiliation,

any other option to handle this situation .

Regards,
Shaban

How distributed are you? For example, are the 30K endpoints all in the same building? Distribution makes a difference and is touched on here. Relay affiliation can sometimes be more effort than it is worth and 30K isn’t all that much. We used it when we had 120K endpoints but now with 15K, we don’t.

If you’re looking for an easy answer, deploy some more relays. They can go anywhere. We have some on Windows, some using TinyCore Linux, some with CentOS that are in the Azure Cloud. If you have a few virtual environments, spinning a few up or sharing the service on a Print/File server is almost a zero-cost option.

Easiest solution is to setup a new relay and divert traffic there based on your config.

Relay Affiliation is also only effective when you configure the clients for Automatic Relay Selection. It has no effect on Manual Relay Selection.

With Automatic selection, the client will use ICMP pings to determine which relays are closest. Affiliation helps provide “hints” to the client for which groups of relays to try first, in case there are both local relays and remote relays that appear to be the same number of hops away (for instance, when a site-to-site VPN is hiding the actual number of network hops between the client and relay).

Its means will set automatic relay selection instead of manual ,

and any possible way to get nearest relay for the client so that will set those machine with manual relay selection just second thought ?

Regards,
Shaban

Manual selection should be good idea in-case when you do not have sufficent relay on particular site/ location which cause the issue while doing automatic relay selection which will points all the endpoints to 1 relay and can cause the network bandwidth issue and also client would not communicate to relay properly.
Hence I would suggest to go for manual relay selection and based on hop count divide the client per relay or add couple of relay in particular site than go for automatic relay selection.

having much relay available but post manual relay selection any option to get nearest relay from the endpoints.

That would be the default when you switch to Auomatic Select. Affiliation is only needed when the client guess the nearest relay incorrectly (such as the VPN case hiding network hops from the client).

We had a similair problem for a client. Very large, many many subnets, networks changes were constant (add/change VLANS) and many endpoints (>30k). To make matters worse, a large portion of the endpoint would MOVE between campuses on a regular bases.

The customer was also very fearful in some locations about bandwith in general and lastly wanted to be able to group devices in automatic groups (not all endpoint where in AD either… so could not leverage that).

What we did find was that the network team of course had the subnet/VLAN information we wanted and live information from SolarWinds. So we asked for a dump in CVS format. What was interesting to me was that they gave us data that looked something like:

SUBNET, LOCATION, City, Address, LinkSpeedUp, LinkspeedDown

10.10.15.34, EastCampus, Boca FL, 11 yamotta Way, T1, T1
10.10.24.32, 3rd Floor Corp, Mirimar FL, 1234 tarce cir, Ethernet, 100 gps
10.45.56.76, Store 123, Jaxsonville FL, xxx 9 street, DSL, DSL

It was interesting to me becuase they had the “Subnet by Location” information (friendly name) link speeds and other nuggets that would help us for making groups of devices and providing the address, friendly name (no longer needing SubNet by Location Wizard) and link speed that might help us understand throttling per location.

Now how can we leverage this? (at the end this was all automated when the Network folks dumped/updated a file for us)

Well, I thought about how the relays.dat file works, basically every time you update a relay or add/remove a relay BigFix compress uip the relays.dat file and send it to all endpoints. So we wrote a script that took the Network team subnet dump and merged it with our relay information:

SUBNET, SEEKLIST, FAILOVER, LOCATION, CITY, LinkUp, LinkSpeeDown

10.10.15.34, EAST;FL1, TIER1;TIER2;TIER3, EastCampus, Tampa FL, 11 yamotta Way, T1, T1
10.10.24.32, CORP;FL1, TIER1;TIER2;TIER3, 3rd Floor Corp, Mirimar FL, 1234 tarce cir, Ethernet, 100 gps
10.45.56.76, WEST;FL1, TIER3;TIER1;TIER2, Store 123, Jaxsonville FL, xxx 9 street, DSL, DSL

It was fairly easy based on the names to plug in our relays groups (affilications) based on the locations to the network teams data. Now we zipped up this file (being all TEXT thousands of lines become a few K in a ZIP file) and the file was added to the a custom site with -SendtoClients. Note that this zipped up file was smaller then most of the other files in any site.

On the client side, we had a policy that looked for the DATE stamp of this file and if it was say new in the last 15 minutes, the policy became active, which did the following:

  1. Unzip the file
  2. Lookup up it’s Subnet in the file
  3. If the settings for SEEKLIST / Subnet by Location / Failoverlist were not set or DIFFERENT then the lookup:
    Set _BESClient_Register_Affiliation_Seeklist
    Set _BESClient_RelaySelect_FailoverRelayLis
    Set Location_By_Subnet
    Set City
    Set Uplink
    Set DownLink

Also note, we put in the common home IP address (like 192.168.1.x) into the file and called location “INTERNET”, if no match was found we put UNKNOWN for location. this helped us find subnets that were NOT our file. At this point during testing we did NOT enable automatic relay selection (all client were manual at the time) but waited to ensure the settings seemed to be correct. We had another Fixlet that would check if all these settings were correct and if in manual mode, become active to switch to automatic mode. So we slow rolled this out and it worked quite well. Even started to play with the LinkSpeed to adjust throttling.

As this was automated, it was rather cool to watch machines change locations (frequent travel between campuses) and watch the Subnet_By_Location and relay selection change dynamically. With a few changes you could also use this method to set MANUAL relay selection based on the same data.

1 Like

Great methods, Dan, and I’d add a couple of notes for the next time you tackle a similar issue.

Site files are automatically compressed when you attach them to a site, so you may be able to “skip the zip”. In my experience the automatic compression is very similar to zip efficiency for text files. The file on the client will be uncompressed, but you can check the site files on a relay or root server to observe the sizes with compression.

I had a similar thing at my last employer, and rather than checking for an update every 15 minutes I’d compare the modification time of the site file to the effective dates of client settings. This was very responsive to file updates, and less overhead than executing an action periodically because the action only ran after the file actually changed (in the case of the roaming client I’d also check something like system boot time or time of network linkup to ensure the mapping is updated when the system moves to a new site)

It’s a sad commentary that there’s often so much useful data like this living in silos, often not available to the BigFix operators. The Network teams usually need to know where all the subnets are, to manage the firewall and proxy rules; the Active Directory team also needs this info, to set up Active Directory Sites for the clients and for cross-site replication; and the Bigfix administrators need to have the same information for efficient relay deployment and selection. Often these three groups, and others, resort to building the maps individually (and inconsistently), when a bit of cooperation could go a long way. Bravo for helping the customer leverage the data they already had!

2 Likes

That is good to know - I didn’t know it was compressed / uncompressed before transit. Much easier then dealing with the ZIP process…