BES Client won't start

(imported topic written by aldridged91)

After upgrading to v7 I have noticed that on several PCs (random locations) the BES Cleint will load but will not run. It will immediately enter the stopped state and give an error message of “The BES Client service terminated with service-specific error 7 (0x7).”

I have removed the client, rebooted, and reloaded the client to no avail.

Any help would be greatly appreciated.

(imported comment written by SystemAdmin)

Could you open a support ticket on this? We’ll want to collect some troubleshooting information with you and will probably look at:

  1. The crash dump file if it exists

  2. BES Client EMSG logging if we can create it

  3. Are all these the same os?

(imported comment written by aldridged91)

I posted an email with the description of the problem and the error to get a ticket started. Sorry for the delay I have been working with Weylan to resolve some other issues since v7 came out.

(imported comment written by bdoellefeld)

I had the same issue happening on my clients. I opened a ticket and was given a method to resolve this. I wish the previous poster would have come back and listed a fix since I did find this post before I opened the ticket.

Here is the solution I was given.

"The problem your are having can be solved by the following: Go to the path c:\Program Files\BigFix Enterprise\BES Client__BESData__Global and delete the file __revocation After this has been done restart the service. "

When asked for a root cause of the issue, here is the response.

“The _revocation file essentially carries revocation information in case a specific administrator is not allowed to administer a client anymore, technically it is a bug causing this issue and we have it fixed in the next revision of 7.0”

(imported comment written by BenKus)

Hi bdoellefeld,

To be clear on this issue, the __revocations file was invalid in this particular instance (it was filled with all 0s for some reason). We are not quite sure why it was corrupted in this form, but it seems it could be caused by multiple issues… a leading theory at the moment is that perhaps a hard power shut-down at the right moment or a hardware failure could have caused this… alternately it might be some rare unknown bug in the BES Client (past or present). This appears to happen only rarely so it is hard to know for sure.

In any case, the agent shouldn’t refuse to start if the file is corrupted and so that is the bug that we will fix for the next version (which basically addresses the issue for everyone except that we still want to know the root cause of the issue if we can).

Ben

(imported comment written by jdonlin)

TeamBigFix,

Is there a list of errors and fixes posted anywhere?

I did a search on “service-specific error 4” which I see once or twice a week.

Thanks,

Jim

(server name edited out)

At 09:59:23 -0500 - actionsite (http://:52311/cgi-bin/bfgather.exe/actionsite)

Gather::SyncSiteByFile error merging data2 problematic C:\Program Files\BigFix Enterprise\BES Client__BESData\actionsite\Action 172512.fxf

At 09:59:23 -0500 -

Unhandled exception during final phase of gathering 19

At 09:59:23 -0500 - actionsite (http://:52311/cgi-bin/bfgather.exe/actionsite)

FAILED to Synchronize - Site data corrupted.

At 09:59:31 -0500 -

Client shutdown (No action site)

(imported comment written by BenKus)

Hi jdonlin,

I don’t think we have ever seen that error so I had to ask one of our developers about it… here is the response I got:

"The gathering error 19 occurs when a new .fxf file appears in the gathering stream and can’t be merged into the clients in memory data structures. The ‘data2’ in the log indicates an exception was thrown while reading the new file by our one of the merge receiver. If the client cannot open the file, or cannot read the file, you would get this behavior. That’s not an expected situation since the client just wrote the file there a moment before. Some other piece of software might have moved the file (a virus scanner might think it matches a signature or ?)

In this situation, the action site content is in an in-between state, partly with stuff from before the gather, and partly with stuff that should be part of the new version of the site. So, the client does the conservative thing, it can’t use a site in this state (it calls this a corrupt site). The site is removed and the expected recovery path is a fresh subscribe to the site. If the site is the action site, the client cannot continue so it exits. This is because there is a lot of code in the client that assumes the action site is present and accounted for. However, the next time the client starts up, it will gather the action site and SHOULD recover unless whatever mechanism that clobbered the file happens again."

So it sounds like something funny is going on with read/write timing events… after you restart the agent does this issue go away? If you can reproduce it, can you try disabling the AV scanner as a test and see if that affects the situation.

Ben

(imported comment written by jdonlin)

Ben,

Thanks for the investigation.

On 12/24, after I started the BESCLIENT, it stopped one minute later ( I saw this in the Event Viewer).

I restarted it now and it has not stopped yet. I stopped the BESClient service and re-started it and it has not stopped yet.

I do run across these #4 errors, probably one or two a week in an environment of approximately 25000 computers.

Thanks.

Jim

(imported comment written by jessewk)

Hi Jim,

Do you notice any pattern between the computers that get the #4 errors? Do they share an operating system or a relay? Is it the same machines every week?

Jesse

(imported comment written by jdonlin)

Hi Jesse,

These are random machines across the enterprise. I have not noticed any relationship but will look the next time I find one.

These all are “corporate-flavored” Windows XP/SP2.

Jim

(imported comment written by rdamours91)

I’ve been finding clients that have the service stopped after the upgrade from 7.01.376 a while back. I’ve been manually deleting the __revocations file and restarting the service to solve the problem. Is there any chance that the besclientdeploy.exe can be modified to look for broken installs and/or restart clients that have been stopped or disabled, etc.

(imported comment written by rdamours91)

I’ve found another couple of these hung clients that weren’t __revocations related. It’s the same Unhandled exception during final phase of gathering 19 error as mentioned above. I ran the diagnostics that I can send in if you’d like. What I found is that the C:\Program Files\BigFix Enterprise\BES Client__BESData\actionsite__Local\tmp directory is corrupt and unreadable. That folder is seriously gibbled as I can open it but can’t delete the contents after a bes client removal. The folder looks empty and has a size of 0 kb but it says it is not empty when I try to delete it.

I’ll see if a chkdsk can recover the folder so I can delete it and try the client re-install. I left the other pc as a benchmark to see if the chkdsk fixes the folder and the client restarts itself on the next boot.

(imported comment written by rdamours91)

They are back from the dead after the chkdsk from the gui with the option to automatically fix file system errors and the other to scan for and attempt recovery of bad sectors. The clients restarted themselves and recovered after the disk errors were fixed.

The pc’s are all new and with the same corporate disk image that we use. Something corrupted this directory that makes it totally unusable… Might be one that you’d want to look into :wink:

(imported comment written by BenKus)

Hey rdamours,

Theoretically nothing the BigFix Agent can do should be allowed to corrupt the filesystem like that because the agent only does normal read/write file calls and the symptoms you described seemed like a filesystem or hardware failure… It is possible that the BigFix Agent is somehow triggering this bad behavior on these systems and maybe we could make a change to workaround these issues, but I am pretty sure in the end that we would be able to point to the fact that something about the OS or the HW is not working as expected… :slight_smile:

If you send the info to support, they can look into it…

Thanks for looking into this,

Ben

(imported comment written by rdamours91)

I doubt that it’s the client alone creating the problem as well. We’re a Trend Micro virus scanner shop with the bulk of our new system being HP. Just checking to see what common circumstances I have with other installations. Who should I send the diag file to in particular?

(imported comment written by Fredrik23)

I just read about this, and we are having the same problem.

I up until now found more than 30 clients out of 2000 that has the revocation error. The amount will probably end up at about 60-70.

This feels like it has started after the update, I am now fixing all the clients and hoping they stay ok.

(imported comment written by BenKus)

Hey Fredrik,

Is the problem you are seeing with the revocations list or the corrupted tmp folder?

Thanks,

Ben

(imported comment written by Fredrik23)

Sorry late answer, but it was the revocation file, i deleted the file and restarted the client, all ok after that.

(imported comment written by cojack91)

We are running BigFix Enterprise Client - 7.1.1.315 and we continually experience the service “BES Client” not running and the log file indicating the __revocations file is corrupted. After deleting this file and then restarting the service the problem is then fix.

I personally think this is NOT acceptable at all and the client installation should have a self healing aspect incorporated into the coding of the application. What is the point of having to hunt down computers which have not reported into BigFix and then manually fix this KNOWN ISSUE!!

When you are dealing with 2000+ clients the administration of this problem is totally unsatisfactory. If it is 1 client or 60 clients with this problem it totally undermining the purpose of BigFix. I feel BigFix should be treating this as a Urgent / Critical Problem and should provide either code or an additional service to fix this problem.

We have to run a vbscript on the computers startup to check for this service and then delete the file and restart the service if it is not running.

(imported comment written by cojack91)

Sent: 13 November 2008 13:15

Subject: RE: BigFix: BES Client won’t start

Hi Shaun,

Some information on this issue:

The issue is caused by a bug in the NTFS Caching system for Microsoft. We have an open issue with Microsoft as we work with them to resolve this issue. I believe at this point Microsoft has identified the issue and we will have a fix in our code base for our next release. As a short term solution until we release the new version of BigFix we have a task in the BES Support site for watching the BES Client service.

In the BES Console if you do the following:

  1. Go to the tasks tab

  2. Expand All Tasks

  3. Expand By Site

  4. Select BES Support

  5. Find the task “Install BES Client Helper Service”

  6. Open up the task and select the option to deploy

This client watcher will attempt to fix the revocations issue and some other common issues if the BES Client service is not running and then restart the BES Client.