Evaluation Cycle of Client Overly-Extensive

After our patching cycle last night, we are seeing some serious deterioration in some clients. I’ve been monitoring evaluation cycles of clients for a little while now and it seems that the clients that show issues from last night are showing, at least, over 30 minutes per evaluation cycle.

What is the best way to determine which specific properties are taking a long time to evaluate on these clients so we can make them better or grab the information some other way?

See the “slow evals” property in this analysis: http://bigfix.me/analysis/details/2994765

You can also enable client debugging on a client that is having an issue and check the logs. You’ll want to look at “usageprofiler.txt.###.log” in the client folder. ( C:\Program Files (x86)\BigFix Enterprise\BES Client )

1 Like

I’m glad someone has started using all that tracking that was put in :smile:

So you also know when you update the settings for the tracking fixlet files, the inspectors also change how much they return (by default its only the top 10 fixlets)

2 Likes

Out of curiosity, is there a client setting that can be deployed that will interrupt an evaluation cycle to deploy an action if the evaluation takes X number of minutes/seconds during the action constraints? We just had a baseline deployment last night where most computers failed because they didn’t run the action within a 5 hour time constraint. The average eval time of the clients that failed was varied from an hour to almost 4 hours.

This also has implications for policy actions where (one that we will be deploying soon) a security violation is discovered that needs to be remediated ASAP but the computer can take a while depending on its cycle taking so long.

Processing actions should not be stopped by the eval cycle, but even so, those are some very long eval loops.

It sounds like you need to:

  • Enable Command Polling to help the client find actions it doesn’t know about
  • Once an hour for clients that sleep or roam networks regularly.
  • Once every 12 hours for all other clients
  • Find out what relevance is causing the eval loop to be so long and do something about it
  • Adjust the minimum analysis interval
  • Increase the max CPU usage to shorten the eval loop

If clients are asleep when you create the baseline deployment actions, then they will not know about it, and they won’t re-gather to find out about them until their gather interval, which is 24 hours by default. This is where Command Polling comes in, so that they client will check for missed UDP messages / commands. I would be surprised if this happened if you had command polling enabled.

To be clear, the cycle isn’t necessarily stopping actions but preventing them from starting. For example, the policy action I commented about above, in testing, only worked immediately after a restart of a client that had an average 2 hour evaluation cycle. Then we would change the security finding to be non-compliant and wait for it to remediate. Once it was in the cycle, it didn’t deploy the action.

Don’t know how I didn’t notice this before but this is very useful!

Thanks jgstew!

1 Like

If the client is not receiving “new content” notifications it will not be aware of the new action until it checks-in on the 24-hour period as @jgstew notes.

In addition to the clients being offline/asleep, it could fail to receive notifications if there is a Firewall or Host-Based firewall blocking inbound packets on 52311/udp, or another application has locked port 52311/udp (likely a DNS Server, see Fixlet in BES Support site to reserve port 52311), or (possibly?) if the relay cannot determine the client’s Registration Address because the client has more than three IP addresses assigned or there is a NAT device between the client and relay.

If any of those are the case, you could try to change the configuration so the clients can receive the 52311/udp notification, or increase the command polling frequency from the client.

Well the policy action is already downloaded and part of the content being evaluated isn’t it? So the action would be a part of that and, when it comes up as relevant, should run. However the cycle has to finish for the action to start. I guess I’ll need to figure out the best way to bring that cycle average down for us.

Are you saying the action isn’t coming relevant?

When a new action is received by the client (or any new fixlet) its relvance is one of the first things that is checked as we assume new stuff is what people are interested in the most. So it should become relevant and placed into the action pool if it is relevant. If it is pending a time period then it will be constrained until then. If you didn’t set the downloads to occur before the time period then it won’t start downloading those until that point which might be the issue.

I see, at least, the name of the fixlet becoming relevant in the client log. I guess that means the client then needs to send that back to the core which then sends the action to the client? I was under the impression that, in this regard, a policy action would not need to check in to start.

The action itself simply rewrites lines in a configuration file so it’s not dependent on any downloads and there are no time constraints. The idea was that when this becomes relevant, it deploys the fix relatively immediately.

A policy action will run completely independently, but it is not sufficient for the client to know it is relevant to the fixlet, it must know it is relevant to the action based upon the fixlet.

If you didn’t take the policy action until recently, then it will take up to it’s gather interval (24 hours by default) for the client to know about it unless it gets the UDP notification or polls for commands.

Once the client knows that there is a new action, it will gather it, and then determine if it should run it, and then immediately run it. If everything is working properly as far as the UDP notifications go, then this whole process should take about 60 seconds max, often as little as 15 seconds in my experience. This should happen independently of the eval loop, unless the client is in the middle of a long relevance evaluation, then it will wait until its interrupt time before stopping evaluation to switch to actions before going back to it’s eval loop.

I assume you have used this? I’m having a problem understanding if the results are “good” or not. For example:

Eval Cycle Max = 145 (is that milliseconds?)
CommandPolling = Disabled (self explanatory)
Eval Loop Avg = 00:29:14.799356 (29 minutes?)
Slow Evals = This analysis property 58 :wink:
Any other items to keep an eye on?