Delayed relay communication

lltc · August 26, 2020, 11:58pm

I have two Bigfix environments.

Dedicated - The Main server, relays, and clients are all on the same trusted network with no firewall or NATs. When a new action it started, all clients respond quickly.
Shared - The main server and shared relay are in DMZ. The customer relay is in their own network, and clients are on the same network as the relay.
Clients can communicate with their relay on TCP/UDP, but their relay is the only machine that can communicate with the shared relay in the DMZ.

If I create an action that applies to the customer, it can take quite a while to reach the client.
My understanding is the relays only communicate on TCP anyway so the UDP doesn’t come in to play here? So why does relay to relay take so long in the Shared environment?

The shared relay cannot initiate to the customer relay. If it could, it would have to be via NAT. But this isn’t configured.
The customer relay reaches the shared relay using TCP 52311.

Everything has command polling enabled for 15 mins.

If I restart the BES Relay service, it checks in immediately and downloads the files and starts the action.

Does the shared relay need to initiate a push to the customer relay?
Otherwise, what is needed to get that action to take place quicker?

Jared · August 27, 2020, 9:16pm

Curiously how long is the evaluation interval? If it is longer than 15 minutes, that might be your issue.

Aram · August 27, 2020, 9:29pm

You are correct that Relay-to-Relay communication is done via TCP, but as you surmise, by default, the upstream Relay will try to send a notifications to downstream Relays when there are new actions/sites available (still via TCP). With NAT (or other similar network configurations), these downstream TCP notifications can be blocked.

That said, have a look please at the following as it may help here: https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/c_persistenconn2.html (note that this requires v9.5.13+).

lltc · August 28, 2020, 12:47am

Thanks @Aram, I will take a look at this in more detail.
However will this work in my setup as it’s the opposite way around?

ie: Parent relay is in DMZ, and child relay is in the trusted zone.
The Parent relay will not be able to establish the connection to the child relay at all, at least not without a DNAT to the child.

Do NATs work to the child since the real IP is reported from the client.

Unless you are referring to Persistent Connection rather than DMZ relay?
https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/c_persistenconn.html

I will check this out as well.

lltc · August 28, 2020, 2:13am

It looks like neither of these will work because the persistent connection is established by the Parent relay.
And the client on the child relay connects to itself as localhost (is this correct?).

I need relay to relay persistence from child to parent.

I’ve added _BESRelay_GatherMirror_UpstreamCheckPeriodMinutes but it doesn’t seem to have made a difference.

I don’t really understand why this would work, but I guess this is something I can try by setting up a DNAT to the child relay.
Child Relay notification with NAT

JasonWalker · August 30, 2020, 3:31pm

Ok, I cringe at some of what I wrote in that linked topic three years ago and I’ve learned a lot more about how the relays work since then

When there is new content, the Root Server indeed notifies its child relays, and parent relays notify their child relays, via a TCP connect from the top-down.

At the leaf relays (relays that have clients), a UDP message is sent from the relay to clients to notify them of new content.

Where the UDP, or top-down TCP, is blocked, the client or child relay is not informed of new content. In that case, it waits for either a new Relay Select, Gather Interval, or Command Poll to check upstream for new content.

Where the Relay is not being notified, a client may command-poll in to that Relay and the relay can either check upstream, or report status based on what the Relay has already gathered. I think your most likely problem is _BESRelay_GatherMirror_UpstreamCheckPeriodMinutes documented at https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/r_client_set.html#r_client_set__cong ,

By default, if the Relay is not receiving the notifications from its own parent, it will only go check every 6 hours for new content. (Also, that check is only performed when a client requests it, so it’s not quite the same as a “Relay Command Poll”)

I think if you tune that value down on the child relay, to be more in line with your Command Poll interval, you should get faster responses on actions. Turning it down too much can add workload on the parent relay structure, so be careful in changing too many relays at once and measure performance impacts as you go; but if you are using dedicated relays you should be fine turning this down to a half hour or so.

You might also consider getting notifications to the customer Relay to work instead. I don’t much recall the details on that thread you referenced, but it seems like at the time at least, I was pretty sure that if we used a Static NAT rather than a Hide NAT on the client relay, that the parent relay would register its NAT’d address and notifications downward should work.

If a static NAT alone isn’t sufficient, you could consider the “DMZ Relay” configuration that @aram references. In your case, your true DMZ relay would take the role of the internal parent relay, and your customer relay (behind the NAT) would take the role of the DMZ relay. You would need a static NAT on the customer-site relay, and configure your real DMZ relay to act as a parent to it. This is the configuration Aram references at https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/c_persistenconn2.html , where you configure your Parent Relay (DMZ relay) to initiate connections to the DMZ relay (Customer-site Relay, in your case). The intent for this config was where the DMZ cannot initiate any inbound communication, at all, to an Internal relay, but I think it should also work for your scenario.

djrobin · September 2, 2020, 3:02pm

Are the Clients set to Manual or Auto Relay Selection? I had similar issues and I set them to Auto and performance improved a great deal as that will utilise UDP