Relay Selection balancing

I’m sure others have experienced this and i’m hoping that someone might be able to give me a little advise.

I upgrading/replacing my public facing with server 2022 relays and after we swapped out the first relay we noticed that we aren’t getting any semblance of balancing between our two public facing relays. During the weekend we had 3300 on one public facing relay and 200 on the other. During the work week this week we are seeing around 1300 on one and 70-90 on the second. I’ve seen mixed results in the forum regarding whether selection goes by response time or randomness to determine who gets the registration if they are both at the same hop count, but I don’t believe we have ever had this much variance between the 2 relays in the past.

Both are set to the same relay affiliation group and have the same weight in relays.dat.

The primary consideration is how many hops there is between the client and the server. You can report on this with relevance like distance of selected server.

Devices will identify the distance away from all relays in their affiliation group and place them into groups based on distance. It will pick the closest group and then do a weighted random selection (using weight from relays.dat) within that group.

So I would recommend making sure that you haven’t replaced it with a server that is an extra hop or two away from the clients.

The BigFix clients use a sophisticated algorithm to calculate which relay is the closest on the network. The algorithm uses small ICMP packets with varying TTLs to discover and assign the most optimal relay.

If multiple optimal relays are found, the algorithm automatically balances the load. If a relay goes down, the clients perform an auto-failover. This represents a major improvement over manually specifying and optimizing relays. However, there are a few important notes about automatic relay selection:

ICMP must be open between the client and the relay. If the client cannot send ICMP messages to the relays, it is unable to find the optimal relay (in this case it uses the failover relay if specified or picks a random relay).
Sometimes fewer network hops are not a good indication of higher bandwidth. In these cases, relay auto-selection might not work correctly. For example, a datacenter might have a relay on the same high-speed LAN as the clients, but a relay in a remote office with a slow WAN link is fewer hops away. In a case like this, manually assign the clients to the appropriate optimal relays.
Relays use the DNS name that the operating system reports. This name must be resolvable by all clients otherwise they will not find the relay. This DNS name can be overridden with an IP address or different name using a task in the Support site.
Clients can report the distance to their corresponding relays. This information is valuable and should be monitored for changes. Computers that abruptly go from one hop to five, for example, might indicate a problem with their relays.

1 Like

I turned on debug logging on the client and did a relay selection they both appear to have the same exact hop count between the client and relay. It would make perfect sense to me if one was closer than the next but i’m hoping there is a way to balance the two relays if the clients are the same hop count away.

Youll want to check all of your clients – not just one client. The easiest way to do this is by creating a client property that tracks distance to relay and comparing it in aggregate.

Also you can use competition size of selected server to see how many relays were in play (number of relays in the group of relays that had the lowest hop count) when it performed selection. If the result is 1 this indicates that the servers are not at the same hop count for the client.

If, after creating the property, you see the exact same hop count across all clients, and competition size is >1 then I’d recommend filing a support ticket with HCL.

How do your clients find the public facing relay?

Is this external host names that get advertised or a failover relay list?

I have 2 relays the same hopcount away (using Affiliation List) and they are not being balanced like my others are.

I ran the bes client diag tool expecting it to show me hopcount and seeklist evluations but it only did hop count. This seems different then years past. Was the evaluation of seeklists in the client diag removed?