Detecting corrupt or hung BESClients

(imported topic written by rmnetops91)

Anyone have any good way for detecting hung or unresponsive BESClient services on systems? We occasionaly have computers with BESClient.exe processes that stop responding or stop logging to their client log. It can be from disk corruption or some other piece of software interfering. We are looking for a way to easily identify these issues, besides having to manually compare the last report time of the agent, to the ping status of the machine on the network (i.e. if we can ping the system, it should have a recent last report status - if it replies but has a really old last report time, we know something is up).

It would nice if the BES server sent an icmp ping and displayed if the host was “up” on the network, next to the node in the console. That way if it has an old last report status but an “up” status from a ping, we would quickly know there is an issue with the BES client on that host.

(imported comment written by BenKus)

Hi rmnetops,

Asset Discovery can help by scanning the network and finding computer that don’t have the agent running…

Ben

(imported comment written by rmnetops91)

Are you talking about unmanaged assets? Problem with that is it won’t show an asset as unmanaged if it’s still in the database as a BigFix client. We are concerned about BigFix clients that are still in the database as a BF managed client, but that have a non-responding client.

(imported comment written by SystemAdmin)

It is tricky to correctly identify and remediate hung or unresponsive clients. There are several types of problems that can make a client appear to be in the state.

  1. Disk corruption: This is a frequent cause. See http://forum.bigfix.com/viewtopic.php?id=4055

a) Site corruption: Maybe the all the sites but one are working. If you can send an action from the Master Operator site, but not a particular custom site, look at the client logs for key indicators such as “Site data corrupted” or “BigFix could not verify the authenticity of the site content” or “Unhandled exception during final phase of gathering”. A custom analysis looking for such items in BES logs (exclude the current log which is locked by the client) can help identify sites that are damaged.

b) Dynamic data corruption: Let’s say a working file gets corrupt, it may be necessary to delete (or move if you can’t delete) parts or all of the __BESdata folder.

c) Agent program files: Sometimes copying a file from a working client of the same build can remedy corrupt program files.

  1. Firewall or router blocking ports to either a relay or BES server.

  2. Agent’s relay is hung, corrupt, and/or blocked from upstream relays or BES server. Verify that other agents in the same sites reporting to the same relay are responding and accepting actions. If not, then you probably have a relay issue to remediate.

  3. Agent is hung in memory.

At first when the BESClientHelper came out, I was hopeful that it would resolve at least the local client issues (obviously it can’t address network or relay problems). Having installed it, it helps in a small percentage of issues.

Since the main issue by far that we see is various types of corruption, I focused on that. An automated daily script runs that is external to Bigfix that checks for the presence of Bigfix in memory. If it isn’t, it moves the __BESdata folder to a junk location, create a log trail, and then starts the agent. The agent can rebuild the dynamic data and report back in.

In the case where some site data is corrupt based on analysis results, but the agent is still responding at a Master Operator level, an action can be sent to the client to schedule a local task to stop the agent, move the partially corrupt __BESdata folder, create a log trail, then restart the agent.

This approach is a bit primative, but it does work. It takes some of the pain out of manually fixing each agent issue. An analysis of the remediation logs provides stats on which clients had problems and when.

It would be wonderful if the Helper service could automatically identify and remediate a few more of these scenarios.

I also concur with rmnetops that a last ping time in addition to last report time would be valuable. Perhaps it would only attempt to ping clients whose last report time is more than say an hour (or some other configurable number) and whose last reported IP address was internal non-public.

(imported comment written by rmnetops91)

Are you saying you have created a script that does this?

(imported comment written by SystemAdmin)

I do use several scripts to address or prevent unresponsive agents in certain (but not all) scenarios. The BESclientHelper is supposed to help, but isn’t effective in several scenarios, corruption in particular.

Note: The following scripts were created and tested on Windows clients only.

Scenario: Agent is reporting and responds to Master Operator actions, but not custom site actions.

An analysis can assist in identifying such agents. Unfortunately, we can’t parse the active client log (which is rather ironic), so we have to settle for previous days logs.

exists lines containing “Site data corrupted” of it of files whose (creation time of it < (now - time interval “1 day”)) of folder (if (exists folder “C:\Program Files\BigFix Enterprise\BES Client”) then (“C:\Program Files\BigFix Enterprise\BES Client__BESData__Global\Logs”) else (if (exists folder “C:\Program Files (x86)\BigFix Enterprise\BES Client”) then (“C:\Program Files (x86)\BigFix Enterprise\BES Client__BESData__Global\Logs”) else (“C:\Program Files\BigFix Enterprise\Enterprise Client__BESData__Global\Logs”)))

Similar properties can also be created for “BigFix could not verify the authenticity of the site content” or others.

In this case, we take the following action as a Master Operator to clear the __BESdata:

if {not exists folder “c:\bad”}

dos md c:\bad

endif

if {not exists folder “c:\temp”}

dos md c:\temp

endif

delete c:\BESData.log

delete c:\temp\cleanup.bat

delete __appendfile

appendfile @echo off

appendfile md “c:\bad{current date}”

appendfile net stop healthservice

appendfile net stop besclienthelper

appendfile net stop besclient

appendfile move /Y “{pathname of parent folder of client & “__BESData”}” “c:\bad{current date}”

appendfile echo %date% %time% > c:\BESData.log

appendfile net start besclient

appendfile net start besclienthelper

appendfile net start healthservice

appendfile exit

copy __appendfile c:\temp\cleanup.bat

dos echo %time% > c:\temp\time.txt

dos at {((preceding text of last “:” of line 1 of it) of file “c:\temp\time.txt” as time_of_day) + time interval “00:02:00”} c:\temp\cleanup.bat /interactive

Scenario: Agent reports, but is unresponsive to MO or custom site actions. Execute the cleanup script above via some other means (interactively, SCOM, etc.).

Scenario: Daily scheduled check to make sure agent is running.

I’ve noticed there are situations in which portions of the agent are corrupt and BESclientHelper is unable to restart the agent. The scripts below are a daily automated attempt to deal with disk corruption that may have occurred in the past day rendering the agent unable to start. The check is scheduled as a local task in the Windows scheduler.

if {not exists folder “c:\windows\sct”}

dos md c:\windows\sct

endif

if {not exists folder “c:\bad”}

dos md c:\bad

endif

if {not exists folder “c:\temp”}

dos md c:\temp

endif

delete c:\windows\sct\BES_clear_sch.bat

delete __appendfile

appendfile @echo off

appendfile for /f %%i in (‘at ^| find “1 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 1 /delete

appendfile for /f %%i in (‘at ^| find “2 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 2 /delete

appendfile for /f %%i in (‘at ^| find “3 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 3 /delete

appendfile for /f %%i in (‘at ^| find “4 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 4 /delete

appendfile for /f %%i in (‘at ^| find “5 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 5 /delete

appendfile for /f %%i in (‘at ^| find “6 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 6 /delete

appendfile for /f %%i in (‘at ^| find “7 Each M T W Th F S Su 7:00 AM c:\windows\sct\BEScheck.bat”’) do at 7 /delete

copy __appendfile c:\windows\sct\BES_clear_sch.bat

waithidden c:\windows\sct\BES_clear_sch.bat

delete c:\windows\sct\BEScheck.bat

delete __appendfile

appendfile @echo off

appendfile c:

appendfile cd\

appendfile cd windows\sct

appendfile del /q check.log

appendfile for /f %%i in (‘tasklist ^| find “BESClient.exe”’) do echo %%i > check.log

appendfile if exist check.log goto END

appendfile start /min BEScleanup.bat

appendfile :END

appendfile exit

copy __appendfile c:\windows\sct\BEScheck.bat

delete c:\windows\sct\BEScleanup.bat

delete __appendfile

appendfile @echo off

appendfile set bad=%date%

appendfile md “c:\bad”

appendfile md “c:\bad%bad%”

appendfile net stop besclienthelper

appendfile move /Y “{pathname of parent folder of client & “__BESData”}” “c:\bad%bad%”

appendfile echo %date% %time% >> c:\windows\sct\BEScheck.log

appendfile net start besclient

appendfile exit

copy __appendfile c:\windows\sct\BEScleanup.bat

dos at 07:00:00 /every:monday,tuesday,wednesday,thursday,friday,saturday,sunday c:\windows\sct\BEScheck.bat /interactive

Notes:

  • These scripts may be a bit primative, but they are effective. They automatically catch and remediate quite a few instances of agent corruption without any intervention.
  • A simple analysis of the BEScheck.log and BESdata.log will tell you the times and frequencies that auto-remediation is happening.
  • Why the old school use of the “at” command? For the interactive switch which seems to be essential in making this work.
  • The “Daily Check” is effective when run as a policy action across the enterprise. Once it is scheduled in the local Task Scheduler, it will automatically deal with corrupt dynamic data (__BESdata) that keeps the agent from starting.

(imported comment written by rdamours91)

Nice work on the scripts. I started something similar and got sidetracked with other work.

I put in the detected Relevance and moved some of the logs into the temp folder and it works perfectly. I picked up 11 workstations out of 25,000 last night and will see how we do today. I remoted in to one of them and the agent has repaired itself and is getting caught up.

I’ll use the cleanup script on others that may have some suspected corruption that don’t have site specific corruption so to speak.

Another example of the power of the forum :slight_smile:

(imported comment written by rmnetops91)

Would be nice if IBM can add some of this functionality natively into the Agent or the Helper service… hint hint

(imported comment written by SystemAdmin)

rmnetops

Would be nice if IBM can add some of this functionality natively into the Agent or the Helper service… hint hint

+1

(imported comment written by gjeremia91)

I’ve developed similar Fixlets and I too believe that the client/relay could do a better job of being self healing/remediation.

The most recent of which is to detect when a Relay has become corrupt so as to prevent the relay from receiving new instructions (such as fixing itself). In this situation, we want to tell the local client to not use the local relay, instead go and find a new relay. There is no easy setting available to handle this yet (enhancement request), but by terminating the relay and restarting the client, we can get new instructions to the client that can rectify the problem (typically by deleting problematic files/sites that can then get synchronized).

I’d be happy to share.

In my experience, data corruption causing situations like this and others described, do not occur without some end user/admin involvement (aka, PEBKAC). We discovered that many instances happen when a user goes snooping in the TEM/BigFix directories and inadvertently locks a file/folder. As a result, I often mark the BigFix installation directory as System and Hidden (on Windows) with a task. This doesn’t prevent the problem, but I feel it helps.