Clients executing task late

darroch · December 11, 2024, 12:10pm

Hello,

I’m having some inconsistencies when trying to have clients execute tasks at specific times, today I scheduled a task to execute at 07:00 UTC across 228 RHEL servers across multiple timezones and virtualisation platforms.

149 servers executed within 1min (between 07:00:00) and 07:00:59)
33 servers executed within 5min of the scheduled time.

43 servers executed it within the first hour (before 08:00UTC)

3 servers very late at:

08:37:32
09:30:05
10:05:23

The task is configured as this and was submitted atleast 18 hours before the task time.

Execution
This action starts 12/11/2024 7:00:00 AM UTC and ends 12/12/2024 12:00:00 PM UTC.

It will run at any time of day, on any day of the week.

If the action becomes relevant after it has successfully executed, the action will not be reapplied.

If the action fails, it will not be retried.

I am only using the “starts on” contraint, not “run between”.

Is this expected behavior? Am I able to ensure that the servers execute the task closer to the scheduled time?

Many thanks,

DanieleColi · December 11, 2024, 12:50pm

Well, the client continually evaluates its content: at each evaluation cycle, if the action is relevant and the start time was reached, the execution starts; otherwise, another cycle begins. Probably, on the “delayed” machines you have long evaluation cycles.
You can check it with (average duration of it, maximum duration of it) of evaluationcycle of client : https://developer.bigfix.com/relevance/reference/evaluation-cycle.html

You can reduce the evaluation cycle by reducing the content/sites client is subscribed to and/or by increasing the client cpu usage: Configuring Client CPU Utilization

JasonWalker · December 11, 2024, 1:36pm

Also consider some of those clients may not be receiving UDP messages informing them of the new action, due to network blocks or the RH Firewall. Check Tip: Troubleshooting Client Reponsiveness for more tips on determining if that is the case, and workarounds such as opening the traffic, enabling Command Polling,or Persistent Connections.

Jstev · December 11, 2024, 2:56pm

I was also going to ask about the possibility of the UDP packet not being received by the client and relying on command polling to check in periodically.

It’s possible you didn’t look at it beforehand but i’d be curious if those clients that started late were sitting at waiting or not reported before the scheduled start date. It’s possible that the clients aren’t receiving the UDP packet that says that they have jobs to process and processed the job when it checked in for normal check-in or when the command poll time period hit.

DerrickD · December 11, 2024, 4:24pm

What is most interesting to me is that you submitted this Action 18hrs in advance. That is more than sufficient time for every endpoint to become of aware of something to do at a specific time.

After submitting, did every client have an action status of “Waiting”? If so, that means the BES Client on every endpoint knew about the job and should start exactly on time (within seconds). If not, then there is some type of major client communication issue. If they did have a waiting status, then I am wondering if the endpoint was under load and opted to not start the task.

Have you looked at BES client logs?

itsmpro92 · December 11, 2024, 6:55pm

By default, the client only checks for new content once per 24 hours, if Command Polling is not enabled.

darroch · January 17, 2025, 10:26am

Many thanks for all of the responses and apologies for the tardy reply.

With further troubleshooting it looks like there are a large number (>250) of fixlets that should be analysis (or disabled/removed completely). This combined with low CPU allocation in the client (and lower spec servers) looks to be the issue.

Some of the evaluation loop averages looked very high (>50, some even triple digits).

I have enabled the “Agent Performance Counters” Analysis (source Tracking evaluation efficiency on BigFix Clients with performance counters - Customer Support)

Some of the clients have what look to be VERY high durations for example:

_other (max 11days!)
_property (max 2days)
_relavance (max 4days).

The plan is to continue with the housekeeping of the fixlets and old Analysis.