Clients not connecting to relay during relay selection

Working with some people on an issue that they are having with relay selection as some systems just do not seem to connect to their proper relay.

We are using relay affiliation which has been working quite well for the majority of systems, but there is a subset that just does not want to connect. I do not believe that it is an issue on the relay as other systems are connecting, so it has to be either something local on the systems, or something between the relay and the client.

I have asked them to try using ping to ping the short and FQDN both ways and this seems to be working fine. The firewall on the system is disabled and the network group has stated that they are seeing the traffic.

From what I can tell in the client logs, it seems to be selecting the right relay from the relays.dat, but just fails to connect and then goes to the failover server.

There is an open PMR on this which has been open for a while and do not seem to be getting anywhere, so thought I would ask the forum :smile:

Thanks

Martin

1 Like

Investigating the last few systems that don’t want to cooperate with relay selection is never easy. Your assessment is most likely correct, but here’s the best way to try and pin it down, most of which it sounds like you’ve done:

  • Enable debug logging to 10000
  • Deploy a Force Automatic Relay Selection task
  • Gather Client Diagnostics with the Relay Selector test (if the number of relays is 250 or less)

The debug log will show you how relay selection appears to the client (who is it sending pings to, is it getting response, etc). The diagnostics will show you whether relays can be reached by ICMP and/or TCP from the Windows perspective. These should align, but if not, it would suggest a bug in the agent code.

Assuming they do align and the pings are not getting a response, the next step would be to look at the network layer which would usually require a wireshark capture on the client and the relay to confirm whether the network requests are actually making it out to the network and being received by the relay system. I would expect you will find the packets are being lost here (e.g. you can see them leaving the client, and maybe on some intermediate routers, but not getting to the relay), which means there is some issue within the network infrastructure between the two systems or in the NIC/network driver layer of the relay system.

How many clients are connected to the relays in total that the other clients are not getting connected to like they should?

It could be that the maximum clients per relay threshold is being hit for that relay, which would be why they are getting bounced.

It may also be possible that the maximum number of open / half-open TCP connections limit is being hit for the OS of the relay itself. This may need to be adjusted.