New content takes an age to get to relays

cscott · September 12, 2024, 2:33pm

Had BigFix for 3 or so years now and this issue has plagued us since we implemented.

Basically, we deploy new content, it seems to take a very long time to get to the end points. Once cached on relays, it’s fine, but getting the content to the relays seems to take too long.

On offers - end users see “pending start” sometimes for hours, waiting on the content being copied down the chain of relays until it gets to their relay. In the client logs of course we see “downloadsavailable : false” until the content is available.

Seems to be an issue for both internal and external clients, perhaps even worse for internal clients since they may have 3 or 4 relays above them in the chain up to the root. Our external relays are top level and right next to root so there should be almost no delay for small content, but can take a long time (over an hour for tiny content).

I have of course raised this over and over with support and our HCL resources but still never seem to get to the root cause.

Anyone else came across this?

JasonWalker · September 12, 2024, 2:57pm

Very likely your downward communication is blocked. From the Root to Relays, and from each Relay to its child Relays, they notify child relays of new content / available downloads /etc. by sending connections on TCP/52311 downstream.

Once we are at the leaf relays, the Relay notfiies its clients about new content/downloads/etc. via UDP/52311.

In your case it sounds like the downstream UDP/52311 is probably working (once the download is available on the leaf relays the clients pick it up quickly) but probably your downstream TCP/52311 is blocked (relays are not able to notify child relays about new actions/downloads/fixlets/etc.)

If that’s the case you’ll likely also see timeouts with RelayNotifier messages in the BESRelay.log of the parent relays. Check for those messages before we go much further.

The best course is to enable the downward TCP traffic. However if there are NATs or you’re otherwise unable to open the traffic there are some other things we can try related to Command Polling, DMZ Relay Configuration, and Persistent Connections.

cscott · September 17, 2024, 12:16pm

Hi Jason, if the downward comms is not working, then surely the content would never get there?

Will check for those errors in the BESRelay logs and see if that gives us something.

JasonWalker · September 17, 2024, 12:30pm

The downward comms is to notify the child relays there is new content and check for it immediately.
Even with no downward comms, the child relays will check their parent for all sites at the gather interval (every six hours by default I think), or more frequently if Command Polling is configured.

JasonWalker · September 17, 2024, 12:34pm

The “Pending Downloads” check should be more frequent once a Relay has requested the download. I could go find a reference on this but my recollection is the relay will check upstream for the download every ten minutes. The parent relay will attempt to notify the child relays as soon as the download is available and short-circuit that time, if the downward message gets through. Otherwise the child relays waiting up to ten minutes on each download (even after the parent has cached it) can add up with multiple downloads or multiple relays in the chain.

cscott · September 17, 2024, 1:13pm

We do see delays of several days sometimes (that could be multiples of the 6 hours I guess) and sometimes delays of 1 hour or so, which could be multiples of the 10 minutes… will get this checked.

Don’t see any timeouts in top level relays though…

cscott · September 17, 2024, 1:31pm

Checked comms from root to top level to intermediate relays and back, no issues (all providing HTTP responses) and none of the relays have any timeout errors.

Do I need to enable verbose logging?

JasonWalker · September 17, 2024, 2:40pm

You probably should open a Support ticket and enable verbose logging. Support can help you go through the logs and see where communication is dropping.

They’ll also want to check the client settings on the root, relays, and a client, to see if you have bandwidth throttling enabled somewhere, and you may need to do some performance tests to see whether you have poor links or maybe QoS throttling on your links.