Bigfix Agent Hung State - Cause: Disk Corruption

(imported topic written by SystemAdmin)

Over the course of time we have experienced Bigfix client services in the “started” state, but the logs stop at a certain point in time. We then try to restart the service and the restart of the service fails. A reboot usually fixes the problem temporarily but we have found something interesting. On about 20 occasions we notice that each time the Bigfix agent hangs like this there is some sort of disk corruption. Is there any way Bigfix can identify this and work around it (Maybe Helper Service)? The problem is, the machines stop reporting in, and we delete them from Bigfix. We only hear about them after they become problems on the network. The permanent fix is to run chkdsk c: /F /R on these machines.

Here a slice of what we normally see on machines where this is happening, after running chkdsk in analysis mode:

5030 data files processed.

Correcting errors in the Volume Bitmap.

Windows found problems with the file system.

Run CHKDSK with the /F (fix) option to correct these.

(imported comment written by BenKus)

Hey Kevin,

That sounds very bad if the computers are experiencing disk corruption… The BigFix Agent tries to tolerate errors as much as possible, but it does have a fundamental assumption that the disk system is functional… I don’t think I have heard about issues like this at other customers, do you know if maybe you guys had some defective HD models?

Also, what version are you running? There was a known and rare issue in the Windows file system that sometimes broke our agent and we worked around this in 7.2 agents (Microsoft has confirmed but hasn’t fixed the issue).

To answer your question, we do have a helper service that you can deploy through the BigFix Support Fixlet site called “Install BES Client Helper Service”:

"This task will install the BES Client Helper Service on the selected computers.

The BES Client Helper Service will periodically check on the BES Client service and attempt to restart it if the service is stopped. It will also attempt to perform a number of troubleshooting steps if the service in not starting up correctly.

The default behavior for the BES Client Helper Service will be to check the status of the BES Client once a day."

You can choose to deploy this to all your computers, but it is better to deploy to specific suspicious computers to help identify problems.

Ben

(imported comment written by SystemAdmin)

Hello Ben,

  • This happens on all different types of machines, the key linking this together is there is always disk corruption on these machines.
  • We are on the latest version of the Bigfix client
  • We are running the Helper Service on all Windows machines, but this doesn’t help the problem.

I am working with support to see if we can find more details, but it is tough to find machines that we have access too that have this behavior, since our environment is so large.

I really want to post this to see if other customers are seeing this same problem and help them determine if the disk corruption is causing these agents to stop communicating.

  • Kevin

(imported comment written by SystemAdmin)

We are experiencing similar issues. When I run a chkdsk, most of the time the reported issues are only in the __BESData folder structure. Doing a chkdsk /f usually fixes this right up, but sometimes we have to delete the __BESData folder and let it regenerate.

Our deployment is ~5500 nodes and while I don’t have exact numbers, we seem to catch several of these situations a week. It is getting to the point where I am about to force our field machines to run chkdsk on every boot automatically.

We also have the BesClientHelper service installed, but it doesn’t appear to help in these situations. All of our Windows agents are the latest version as well (7.2.4.60).

(imported comment written by BenKus)

Hey guys,

We are not aware of this issue… if you have a computer in this state, we would like to work with you to identify the issue of the hung agent and also see if we can assist in figuring out the file issues…

Ben

(imported comment written by rmnetops91)

We have seen this issue on a few machines as well. I was actually searching for this issue. I will check the system for any disk issues to see if it’s related.

(imported comment written by rdamours91)

Is it possible that the disk errors are from running the pc’s in ide compatibility mode when they are sata drives.

We had an issue where the sata disk drivers were not part of the xp image and the workaround from the guy who created the image was to put the machines into ide compatability mode in the bios. This caused no end of performance and disk issues.

(imported comment written by SystemAdmin)

We see this on a regular basis with Bigfix agents (and sometimes relays) getting corrupt. Usually the cure is to clear the __BESdata folder and let it rebuild itself. Depending on the nature of the corruption, you may not be able to even delete the bad file or folder. In that case, move it to a junk folder and let the agent rebuild the dynamic data.

(imported comment written by rmnetops91)

So what is the best way to monitor for agents where this occurs?

(imported comment written by js_tom91)

how any i remove the __BESData folder through the BES console??

Can we do it through some applicable “tasks”

(imported comment written by SystemAdmin)

You can’t remove __BESdata on clients directly from the console. There are work-arounds however.

Data corruption in the __BESdata folder is a common, but tricky issue. There are multiple ways it can manifest. If, say, only a specific custom site is corrupt, then updates to that site or actions from site are ignored, but other sites work fine. In that case, I’ve made a custom task in the master action site that basically schedules a local task to stop the agent, create a randomly named “junk” folder (in the event of needing to run the same task again in the future), and MOVE (not copy) the __BESdata folder to that random “junk” folder. In this case, the client will still run the action from the Master action site (which isn’t corrupt).

Sometimes that __BESdata folder is so corrupt that the agent won’t start. The best cure for that is to be preemptive. A daily local scheduled task searches for the BESClient as a running process. If found, the script ends. If the agent isn’t running, a second script is invoked that follows the same process described above and MOVES the corrupt __BESdata folder to a junk location, then restarts the agent. This approach has successfully auto-remediated many machines for us.

We initially thought BESClientHelper was going to do that it seems to be limited in what it does. It would be wonderful if it was updated with some intelligence to deal with the corruption issue rather than just blindly restarting the agent.

BTW, a similar process also works for corrupt relay metadata.

(imported comment written by SystemAdmin)

This doesn’t sound like a TEM/BigFix problem, it sounds more like hardware failures, exacerbated by TEM/BigFix. Too many rewrites on the same sectors causing increased failure rates? I wonder how wide spread this problem truly is. With 10,000+ current clients, and something in the range of a total of 40,000+ clients in the planning stages, even a 0.1% yearly occurrence of this kind of problem would be a pain for us.

That said, it would be nice if the BESClientHelper service could perform some form of validation on the __BESdata folder when it finds a stopped BESClient. Without that function, it sounds like a scheduled job to perform an occasional, but through, ChkDsk would be a good idea (maybe only when the BES client has failed, like JonL does). Maybe monthly, spread across machines so they don’t all do it at the same time. It would also be nice to have the ability for the BESClientHelper service to be able to report, out of band as it were, when it finds clients that has crashed. Maybe via SMTP, or even SNMP, or SysLog.

Is this something that people are only seeing on Workstations and laptops, or has anyone seen it happen on a server? I’m wondering if higher quality drive subsystems on servers show the problem or not?

A document from Microsoft (intended for home users, but …) : http://www.microsoft.com/athome/setup/maintenance.aspx

(imported comment written by SystemAdmin)

I have seen it on both workstations and servers, and running a chkdsk will always return this message “Windows found problems with the file system.”

(imported comment written by SystemAdmin)

I’ve seen it on both client and server machines, but to a much smaller extent on servers. It also varies by OS. WEPOS is the worse OS for disk corruption by far in our organization.

The independently scheduled tasks I mentioned previously seem to be the easiest approach to automatically deal with most of the issues. There will always be a few that will require manual intervention.

IBM: Is there any hope of getting an upupdated BESClientHelper that can perform these steps automatically?

(imported comment written by SystemAdmin)

I’m sorry, but what is WEPOS?

(imported comment written by NoahSalzman)

Windows Embedded Point-of-Sale.

http://www.microsoft.com/windowsembedded/en-us/evaluate/windows-embedded-pos-ready.aspx

(imported comment written by SystemAdmin)

Is there any common element to these machines that exhibit the problem? For example is FAT32 more prone to the issue than NTFS?

(imported comment written by SystemAdmin)

Hi all,

This sounds very similar to a recent (still ongoing) situation that I have been experiencing on my Unix servers. The agent appears to just randomly stop processing. Occasionally, sending the agent a ‘Refresh’ kicks it back into life. Mostly, though, I have to stop and restart the agent, sometimes having to delete the __BESData directory as well.

I did have a call open but between the support desk and some excellant help from Jeff Sexton and Doug Coburn we couldn’t nail down exactly what was causing this.

After reading thread, I’m wondering whether to set up a batch job to run after midnight to stop the agent, archive yesterdays log, remove the __BESData directory and restartd the agent. Hmmmm???

Mark

(imported comment written by SystemAdmin)

AlanM, all of our Windows machines run NTFS. We see frequent corruption issues on WEPOS and occasional issues on other OSes.

cidermark, doing a daily removal of the __BESdata folder would be the shotgun approach to resolving. We shouldn’t need to resort to extremes like this to keep the agents going.

IBM: A more resilient agent process that is more self-healing and/or a better BESClientHelper would be very welcome.

(imported comment written by SystemAdmin)

cidermark

The agent appears to just randomly stop processing. Occasionally, sending the agent a ‘Refresh’ kicks it back into life. Mostly, though, I have to stop and restart the agent

Not sure if this is the same problem, but we’ve seen this as well on mostly on Win7, OS X and a few WinXP systems. We haven not experienced the corruprtion issue (although I have had it happen at a previous employ) though.