VM Disk IO BES Client service times out

(imported topic written by olsonc5891)

We have a ton of VM’s in our environment. The Disk IO is absolutely horrendous. Deploying patches to the entire environment, even if it is staggered over several hours doesn’t help much. Any advise on how to handle that (besides fixing our storage solution, we are working on that)?

It is so bad, that after the patch deployment and reboot, the machines respond so poorly that the BES client times out and fails to start. Serveral other services fail as well. Of course, I can’t use Bigfix to start these services because they are not reporting. Is there a way to force a start of the BES client service to computers that are not responding because the service is not running? Currently, I am logging on to each server independently to start the BES Client service. There goes my weekend!

Any advise would be appreciated.

Thanks,

Chris

(imported comment written by wnolan91)

To get your BESClients up and running… create a quick batch file to run…

SC {Computer1} start BESClient

SC {Computer2} start BESClient

SC {Computer300} start BESClient

should be fairly easy in Excel… with a list of machines… put this in Column B. =CONCATENATE(“SC \”,A1," start BESClient")

For setting the VM environments, there are some new settings in the BigFix v8.x that will help a lot… with “DeepSleepMode”

Client settings to tune it for virtual infrastructure:

http://support.bigfix.com/cgi-bin/kbdirect.pl?id=1045

Also, tune the ‘Idle’ cpu usage which is the normal cpu usage while the system is inactive:

http://support.bigfix.com/cgi-bin/kbdirect.pl?id=247

(imported comment written by olsonc5891)

Thanks for the information. On the VM’s, I am having trouble understanding where to apply the configuration change:

"Servers running multiple virtual systems (such an ESX Server running multiple VMWare images) have less resources available for background tasks and so it is useful to decrease the resource usage of the BigFix Agents running in the virtual computers. Decreasing the resource usage will make the BigFix Agents run somewhat slower and respond to actions slower than BigFix Agents running with the default resource usage settings.

Please make the following configuration changes to all BES Clients running on virtual servers that host multiple images. These configurations will help to keep resource usage low for the server."

Does this mean to apply this only to ESX Hosts that have multiple VMware machines in them or do we need to change the configuration on all the VMware machines in the host?

I guess question is: Does an “image” mean a VMware machine?

I reviewed a couple of different VMware glossaries and the term “image” is not in them. I just want to be sure.

Thanks,

Chris

(imported comment written by BenKus)

Hi Chris,

You can change the client setting on any VM machines (and we sometimes say “images” instead of “machines”) that are on a shared host. Changing these client settings on the host computer is fine too, but less important.

If you set these settings, it will likely fix your problem situation because the root cause is probably that all the agents are working at the same time to respond to your new patch action. You should also enable “temporal distribution” for your actions (maybe for 60 minutes?) to help spread the load of the actual patch installations.

Ben

(imported comment written by olsonc5891)

Thanks Ben. One last question. If I enable “temporal distribution” for 60 minutes, does that mean it will patch one server every 60 minutes? That doesn’t sound right. Please define “temporal distribution”.

Lastly, I tried to get a description for the client settings from the support pages. The ones below are not listed in the article and a search doesn’t give me anything. Their are two similar ones called Sleepidle and Workidle but the values don’t match.

Set the _BESClient_Resource_SleepNormal client setting to value 480 (Default 1).

Set the _BESClient_Resource_WorkNormal client setting to value 10 (Default 20).

I did put a checkmark in Distribute over “5” minutes to reduce network load when creating the action.

Thank you,

CO

(imported comment written by wnolan91)

“temporal distribution” - You have to understand everything runs from the point of view of the Client. The “Temporal Distribution” will randomly pick a time when the Client downloads and processes the action, for the amount of time you specify in the “Temporal Distribution”. So if you send an action today, with a 24 hour “Temporal Distribution” all clients that will accept UDP will download the action at the time the action was created… **** This is what they are trying to pervent **** as it wakes up all the clients around the same time to download the action file from the Relays. So the clients get the Action, and now will Randomly pick a time from “NOW” on the computer to 24 hours in my example. If you keep this action alive, for a long period of time… lets say a month, and 2 weeks later add another computer to the environment, this computer will do the same thing, where it may take 24 hours before it will action this… Don’t get confused that all machines will be patched in a 24 hour period, because you put a temporal distribution of 24 hours as NOW is when the client process the Action File.

I’ll let Ben try and explain the difference between WorkIdle/WorkNormal and SleepIdle/SleepNormal. I just know that the “Normal” decreased the “I/O” by about 14% on our VM’s, during normal client usage.

One other thing that they don’t really cover too much is that you shouldn’t use the “Master Operator” account to send actions. This may have been fixed but I don’t remember seeing anything related to this.