Slow processing of content

TheTick · December 5, 2018, 5:35pm

Hi All,

Looking for some suggestions on what I can look at.

I have a policy action that is set to run every 15 minutes. Now I do not expect it to fire every 15 minutes, but I would hope for every 30 to 60 minutes. Basically all this policy does is check the value of a registry key and depending on the value of the key, another key is set. When the action runs, it runs very quick, so not a problem there.

The biggest problem is that this only happens on a few systems. Most of my environment runs as I would expect. I thought that this might be an issue with load on the server, but the CPU is only at about 5% with a peak every few hours of maybe 40%. I then thought it might be a hypervisor issue since the system I was checking is a VM. I then checked other VMs on the same hypervisor and they did not have this issue. On top of this, I found a physical server with 12 processors and 16GB RAM that has the issue. That server is actually a DR server, so it had no load at all on it.

I have enabled the Usage Profiler on it and I can see that on the slower system it is for sure…slow When I compare the usage profiler logs between a normal system and the slow system, it identifies many of the same top 10, but the evaluation times are just way longer. Also I see that the samples on the slow system is about 3000-5000 and on a normal system it is about 50,000 to 60,000.

Also to note, both the normal and slow systems evaluate all the same content including custom baselines, analysis and fixlets/tasks.

Here is an example of time differences between two systems
Normal server:
Mon, 26 Nov 2018 08:59:33 -0700 Complete file Enterprise Security/2014 Security Bulletins (Apps).fxf: 121381 microseconds

Slow server:
Tue, 20 Nov 2018 04:52:48 -0700 Complete file Enterprise Security/2014 Security Bulletins (Apps).fxf: 4244516 microseconds

I know that we have 2 baselines with about 125 components in them and I am going to work with my server team to clean those up, but these baselines are processed by normal and slow systems.

On one of the slow systems, I did increase the CPU to 10% and it made a bit of a difference, but it is still not near a normal system.

I do have a PMR open already, but we have been on it for a couple weeks and I need to get some other ideas to test.

I can share the policy action if needed, but not sure if it will point to anything as it is not an issue on about 95% of our systems.

Thanks

Martin

TheTick · January 7, 2019, 8:45pm

I opened a PMR on this issue and we are still working on it. One thing of interest is I re-read the http://www-01.ibm.com/support/docview.wss?uid=swg21505852 article and saw the note at the bottom that reads

“All of these calculations apply to a single processor, so if you have multiple processors the overall % of agent CPU is reduced significantly because it is divided by the number of processors. For example, if you want your agent to use less than 5% of CPU and it has 4 processors, you must set workidle to 100 and sleepidle to 400 because [100 / (100 + 400)] / 4 = 5%.”

I was testing on a server with 12 CPUs and saw in the usageprofiler logs that I was getting about 10,000 samples per hour compared to a “normally functioning” system that gets about 50,000 per hour. Both are subscribed to the same content and are the same OS (win2008 R2).

Looking at the 12 CPU system, I noticed that the CPU in task manager never showed the BESClient using any CPU for the most part. I never did see it go above 1%. The server total CPU utilization is typically less than 3%. It is also a physical server with the BESClient installed on a local drive, so no issues with VMWare or SAN.

After looking at the above article, I set the work idle to 300 and the sleep idle to 400. This was determined by using the formula in the article to find settings for about 3% based on a 12 CPU system. After setting this and restarting the BESClient, I now see that the BESClient is using 3-4% of the CPU and the usageprofiler is processing 300,000 samples.

Now, I do not need it to be processing that many samples, and I really do not need 3% across 12 processors, but at least it is responding to the policy action as expected.

Has anyone else seen this occur? Any tips on what you do in this situation?

jgstew · January 7, 2019, 11:42pm

What was the sleepidle and workidle on the systems that had this issue?

It could be that the values were a bit too low if a single core of the CPU is too slow. I still wouldn’t increase the CPU usage quite as significantly as mentioned in that article.

The other thing that I recommend doing is turning on “Power Save” which causes the BigFix agent to use 0% CPU for 10 minutes at a time, then wake up, do an evaluation loop, then go back to sleep. This causes the average CPU usage for the client to go way down and saves energy, but allows you to set the CPU usage of the client to be higher, like 5% to 10% of a single CPU core so that the client can do the evaluation loop in bursts. This can end up with the case where the evaluation loop completes much faster, but ALSO the client CPU usage ends up being less overall. Also, modern processors save energy by going into deep idle states, so this configuration also greatly benefits the battery life of laptops.

https://github.com/jgstew/bigfix-content/blob/master/fixlet/clientsettings/Set%20__BESClient_Resource_PowerSaveEnable_%20to%20_1_%20-%20Universal.bes

TheTick · January 8, 2019, 6:31pm

The work and sleep were at the default 10/480.

For the most part, systems seem to be ok, I do have some single and dual processor systems that are slow, but what I am finding is that it is because they are already busy, so makes sense that the responsiveness is not as good. I can live with that as I understand the root cause.

I pulled together a summary of the systems and it seems that Win2012+ does not have this issue no matter how many CPUs. Some of the slower Win2012 systems seems to be because the CPU load is higher (randomly checked in SCOM).
Win2008 seems to be ok on up to 6-8 CPUs, but 12 or more there seems to be some significant loss of performance.

I can take a look at the power save and see what happens, but I wonder if this is a Win2008 issue.

Thanks for the feedback.

AlanM · January 8, 2019, 8:30pm

When doing the work/sleep calculations you definitely need to think of a single processor when doing it. The numbers will look lower when there are more cores/threads.

The other thing to look at if you expect something to run is what your evaluation cycle is. See https://developer.bigfix.com/relevance/reference/evaluation-cycle.html#average-duration-of-evaluation-cycle-time-interval