Hop count to select relay is wrong

(imported topic written by Rolf.Wilhelm91)

Dear all,

I wondered, why the new clients in a new location in Germany selected a BigFix Relay in US and not the one in Switzerland. I made some ping and traceroute-checks and found, that the best BigFix Relay has a higher hop count (see Code frame 1) than the one which was selected automatically.

Tracing route to <anonymized> over a maximum of 30 hops:   1     2 ms    <1 ms    <1 ms  <anonymized> 2     9 ms    24 ms    27 ms  <anonymized> 3    <1 ms    <1 ms    <1 ms  <anonymized> 4     1 ms     1 ms     1 ms  <anonymized> 5     5 ms     5 ms     5 ms  <anonymized> 6    16 ms    13 ms    16 ms  <anonymized> 7    15 ms    15 ms    15 ms  <anonymized> 8    14 ms    15 ms    14 ms  <anonymized> 9    17 ms    16 ms    16 ms  <anonymized> 10    16 ms    16 ms    18 ms  <anonymized> 11    17 ms    17 ms    16 ms  <anonymized> 12    16 ms    16 ms    16 ms  <anonymized> 13    17 ms    17 ms    15 ms  <anonymized> 14    25 ms    16 ms    16 ms  <anonymized>   Trace complete.

Example 2: low ping rate, but hop count is shorter:

Tracing route to <anonymized> over a maximum of 30 hops:   1     1 ms    <1 ms    <1 ms  <anonymized> 2    <1 ms    <1 ms     3 ms  <anonymized> 3     1 ms     1 ms     1 ms  <anonymized> 4     1 ms     1 ms     1 ms  <anonymized> 5     5 ms     5 ms     5 ms  <anonymized> 6   146 ms   152 ms   146 ms  <anonymized> 7   161 ms   148 ms   146 ms  <anonymized>   Trace complete.

This explains also, why Clients in some Countries also selecting the wrong Relay. We do not have and do not want a Relay Component in each location. We do not have control about the number of hops, because this is managed by the MPLS network from BT.

Any hints how to select the ping latency to check and not the number of hops ?

Thanks,

Rolf.

(imported comment written by BenKus)

Hi Rolf,

Our relay selection algorithm makes the assumption that low hop counts have a strong correlation to high bandwidth. In the event that this is not true, you can manually set the BES Clients to point to a better relay that you know has higher bandwidth.

Starting in the next version of BES, we allow you to make “Relay Groups” that will let you solve this problem easier by allowing you to include/exclude groups of BES Relays for agents.

Ben

(imported comment written by SystemAdmin)

Ben I noticed the same issue. Has this feature been released yet?

I strongly believe that Bigfix should do relay selection like McAfee does. Average out time from 3 ping responses to the same host, not by hop count. This is causing us major issues and headaches since we tried moving over to automatic relay selection.

McAfee logic

http://74.125.95.132/search?q=cache:zYKk8BXoKM0J:https://kc.mcafee.com/corporate/index%3Fpage%3Danswers%26type%3Dsearch%26question_box%3DKB55685+mcafee+closest+repositories&hl=en&ct=clnk&cd=1&gl=us

(imported comment written by BenKus)

Hi Nicky,

Yes. BES 7.0+ has the feature “Relay Affiliation”: http://support.bigfix.com/bes/install/besrelayaffiliation.html . The way this works is that the agent will only attempt to do relay selection for members of its relay affiliation group.

The article you sent from McAfee is interesting and I can see that it might work in many situations… and note that I am not an expert in their solution, but their method seems very flawed for many customer deployments that I have seen… Here is why:

  • They determine “Subnet Distance” by “The distance is the number of bits from the lowest bit, which differs between two IP addresses.” – To me, that is just an awful idea because companies don’t necessarily consider distances when they number their subnets… To illustrate, here is a simple scenario:

  • 10.0.1.0 subnet in New York with Relay

  • 10.0.2.0 subnet in London with Relay

  • 10.0.3.0 subnet in New Jersey no Relay

The agents in New Jersey probably have the highest bandwidth to NY, but the London relay would be considered “closer”, which is bad… I would hate to base our relay selection algorithm on arbitrary network conventions that differ with each company…

  • Also, our relay selection appears to be much more dynamic than McAfees based on what I understand from this article. So BigFix will find better relays faster, failover faster, and recover faster because we run our relay selection more often (by default, every 6 hours the agents will check to see if there is a better relay)… This is important because ping latency can vary a lot in small periods of time… so if there is a temporary network condition that affects latency and the agents begin to change to different relays, that could be very bad too…

Again, I am not an expert on McAfee’s “repository selection” and I am sure it works well for many companies, but I don’t think it necessarily would help us do better relay selection…

Ben

(imported comment written by SystemAdmin)

Ben - I just realized that you misunderstood me because I didn’t clearly state which method I was referring to. When I posted that link which now changed to https://kc.mcafee.com/corporate/index?page=content&id=KB55685 I was speaking of the “Ping Time” not the “Subnet Value” selection method.

The hop count algorithm is causing us a lot of problems and is resulting in a lot more administrative overhead than should be necessary. For example we have some sites with different IP schemes, which results in a higher hop count so those clients hit the WAN when the LAN is much faster but the hop count is much more. We also have sites in which the hop count in the Eastern part of the world is less to the Western part of the world when a site in the next country on a fast link is ignored because the hop count is higher. It is a bad way to determine relays and inefficient. This really needs to be fixed. I see this as the biggest flaw in the product.

From the McAfee site:

ICMP ping using IP address

ICMP ping using DNS name

ICMP ping using NetBIOS name

The agent does not send any data to the server

The size of ICMP ping packet is 32 bytes

The agent calculates delay time by averaging response time of 3 ping attempts

The agent counts the delay time until the connection is established, then disconnects from the server

If the agent cannot connect to a server, that server is disabled in SiteMapList.xml.

Subnet value method: The ePO agent uses IP addresses to sort the repository servers according to the distance of each server from the host computer. The distance is the number of bits from the lowest bit, which differs between two IP addresses.

An example of IP address distance calculation:

Agent Computer:10.0.1.1

Localhost: 10.0.1.1 = 0

Repository A: 10.0.1.5 = 3

Repository B: 10.0.2.1 = 10

http://www.nai.com/: 63.215.198.30 = 30

If the agent cannot connect to a server, the server is disabled in SiteMapList.xml.

User defined list method: ePO Agent sorts the servers according to the configured order. The Fallback site is always the last repository server entered.

  • Kevin