Baseline interrupted by a restart does not retry

(imported topic written by MattBoyd)

Since upgrading our Win7 32-bit clients to 8.0.627.0, we’ve started noticing that they sometimes fail to retry/resume a baseline if they were interrupted while running it. We have several baselines with settings as follows:

+This action will never expire.

It will run at any time of day, on any day of the week.

If the action becomes relevant after it has successfully executed, the action will be reapplied as a policy an unlimited number of times.

If the action fails, it will be retried up to 3 times, waiting 1 hour between attempts.

If a member action fails, the action group will continue to run.+

As an example, we have one workstation that was running two baselines at the same time (according to the logs). Baseline A triggered a restart while Baseline B was still running. After the restart, Baseline B reported its state as Failed, but never retried despite the settings listed above. The Baseline B action member that was running at the time of the restart also reported failed, but the remaining action members in the baseline continued to show “Waiting” and never ran. Other actions that were not part of Baseline B continued to run successfully.

Has anyone else run into this? So far, the only way that I’ve found to recover from this is to delete the site folder under C:\Program Files\BigFix Enterprise\BES Client__BESData that the action belongs to and restart the BES client.

We’re going to reconfigure our baselines to prevent the Baseline A restart from happening while the Baseline B is running, however, the client ought to recover with more grace when a restart or any other interruption occurs.

(imported comment written by BenKus)

Hey boyd,

Have you noticed this several times?

The agent does a fair amount of work to try to serialize its state to disk before a restart so that after restart, it can make sure to continue gracefully… However, over the years we have had a few bugs in this area… I believe we don’t know any open bugs at the moment, but it is always tricky when a computer needs to restart and processes only have a short time to exit.

It sounds like in this case your agent had some of its state corrupted and it prevented it from properly running this action. If you happen to see a reproducible case, we would be happy to test it out (I tried a basic test based on what you said and I didn’t see the same problem).

Ben

(imported comment written by MattBoyd)

Thanks Ben, I appreciate you giving it a try. I will try to find other machines that have a similar issue. I’ve seen at least two clients so far that are “stuck” on a baseline, but that doesn’t mean that it’s the same for both.

If/when I come across it again and figure out how to consistently reproduce it, what information should I provide to support? Should I run that diagnostic utility and provide that?

(imported comment written by SystemAdmin)

Boyd, there are definitely some idiosyncrasies (bugs???) about the way v8.0 handles restarts within baselines. It seems that, in v8.0 and unlike 7.x, if a mid-baseline reboot is called by anything other than the restart action script command, the agent interpretes that as a failure and does not complete the balance of the items.

See this thread for more detail … http://forum.bigfix.com/viewtopic.php?id=6386

Is this similar to what you are seeing?

(imported comment written by MattBoyd)

JonL, I definitely think we’re both seeing the same issue. I do see the “aborted” messages in the logs when this occurs. However, what I’m seeing is that the status of the aborted action becomes “Failed” and the remaining actions that are relevant continue to say “Waiting”. As I understand it, your remaining actions in the baseline say “Not Relevant” … correct? Our baselines are supposed to keep going if one of the actions fails. Also, they’re supposed to retry if the baseline fails.

It seems like the client isn’t checking if the baseline (or multiple action group… I use both terms interchangeably) was interrupted midway through, and it just gets stuck there. Do you have a support case open? I think this is a serious issue with the client. We depend very much on the baselines to reliably retry/continue when they fail.

PS - I thought I remembered someone else mentioning this issue, but I couldn’t find the post! Thanks for chiming in!

(imported comment written by MattBoyd)

Here’s a quick baseline that I created and reproduced the issue with. It contains a series of actions that each run for 5 minutes. To reproduce the issue, initiate a restart from the command prompt while the baseline is being executed by the client: shutdown -r -f -t 0

Here’s the result that I got:

+Summary

The action failed.

This action has been applied 1 time.

Status Failed

Start Time 2/10/2011 2:16:53 PM

End Time 2/10/2011 2:29:59 PM

Exit Code None

Sub-action Status

Completed Running Action Test

Completed Running Action Test

Failed Running Action Test

Evaluating Running Action Test

Evaluating Running Action Test

Evaluating Running Action Test

Evaluating Running Action Test +

I’m not sure why the remaining actions say “Evaluating” instead of “Waiting” this time… and the action hasn’t been retried since failing, but is supposed to retry every 10 minutes up to 10 times. I’ll keep an eye on it to see if it ever ends up retrying, but I doubt it will.

I reproduced the issue on a Win7 32-bit client with BES 8.0.627.0

Log extract:

At 14:16:53 -0500 - opsite# (http:
//<server>:52311/cgi-bin/bfgather.exe/opsite#) Relevant -  (fixlet:167875) Relevant - Running Action Test (fixlet:167876) Relevant - Running Action Test (fixlet:167877) Relevant - Running Action Test (fixlet:167878) Relevant - Running Action Test (fixlet:167879) Relevant - Running Action Test (fixlet:167880) Relevant - Running Action Test (fixlet:167881) Relevant - Running Action Test (fixlet:167882) Relevant - Baseline 
"stuck" Test (fixlet:167866) At 14:16:53 -0500 - ActionLogMessage: (action 167875 ) Action signature verified ActionLogMessage: (action 167875 ) starting group action ActionLogMessage: (action 167876 ) starting sub action At 14:16:54 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command succeeded createfile until ENDOFFILE (fixlet 167876) Command succeeded delete No 
'sleep5.vbs' exists to delete, no failure reported (fixlet 167876) Command succeeded copy __createfile sleep5.vbs (fixlet 167876) At 14:16:56 -0500 - Report posted successfully. At 14:17:51 -0500 - Report posted successfully. At 14:19:39 -0500 - GatherHashMV command received. No matching site. At 14:20:42 -0500 - DownloadPing command received (ID=167884) At 14:21:10 -0500 - Report posted successfully. At 14:21:55 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command succeeded (Exit Code=0) waithidden cscript.exe sleep5.vbs (fixlet 167876) At 14:21:56 -0500 - ActionLogMessage: (action 167876 ) ending sub action At 14:21:56 -0500 - opsite# (http:
//<server>:52311/cgi-bin/bfgather.exe/opsite#) Not Relevant - Running Action Test (fixlet:167876) At 14:21:58 -0500 - ActionLogMessage: (action 144410 ) Action signature verified ActionLogMessage: (action 167877 ) starting sub action At 14:21:58 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command succeeded createfile until ENDOFFILE (fixlet 167877) Command succeeded delete sleep5.vbs (fixlet 167877) Command succeeded copy __createfile sleep5.vbs (fixlet 167877) At 14:22:04 -0500 - Report posted successfully. At 14:23:56 -0500 - GatherHash command received. No matching site. At 14:24:37 -0500 - Report posted successfully. At 14:24:54 -0500 - GatherHash command received. No matching site. At 14:27:00 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command succeeded (Exit Code=0) waithidden cscript.exe sleep5.vbs (fixlet 167877) At 14:27:01 -0500 - ActionLogMessage: (action 167877 ) ending sub action At 14:27:01 -0500 - opsite# (http:
//<server>:52311/cgi-bin/bfgather.exe/opsite#) Not Relevant - Running Action Test (fixlet:167877) At 14:27:04 -0500 - ActionLogMessage: (action 144410 ) Action signature verified ActionLogMessage: (action 167878 ) starting sub action At 14:27:04 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command succeeded createfile until ENDOFFILE (fixlet 167878) Command succeeded delete sleep5.vbs (fixlet 167878) Command succeeded copy __createfile sleep5.vbs (fixlet 167878) At 14:27:11 -0500 - Report posted successfully. At 14:29:57 -0500 - ShutdownListener At 14:29:59 -0500 - Power History: Failed to process system and monitor events - Windows Error: The data is invalid. ActionLogMessage: (action 167875 ) ending group action (aborted) At 14:29:59 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Command failed (Action aborted before completion) waithidden cscript.exe sleep5.vbs (fixlet 167878) At 14:29:59 -0500 - ActionLogMessage: (action 167878 ) ending sub action ActionLogMessage: (action 167875 ) ending group action (aborted) At 14:30:00 -0500 - Client shutdown (Service manager shutdown request) At 14:32:19 -0500 - Starting client version 8.0.627.0 FIPS mode disabled by default. At 14:32:29 -0500 - Restricted mode At 14:32:36 -0500 - RegisterOnce: Attempting to register with 
'http://<relay>:52311/cgi-bin/bfenterprise/clientregister.exe?RequestType=RegisterMe60&ClientVersion=8.0.627.0&Body=322341&SequenceNumber=158&MinRelayVersion=6.0.0.0&CanHandleMVPings=1&Root=http://<server>%3a52311&AdapterInfo=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx' Unrestricted mode Configuring listener as wake-on-lan forwarder(AdapterInfo=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx). Registered At 14:32:47 -0500 - actionsite (http:
//<server>:52311/cgi-bin/bfgather.exe/actionsite) Successful Synchronization with FixSite (version 303759) - 
'http://<relay>:52311/cgi-bin/bfenterprise/BESGatherMirror.exe?url=http://<server>:52311/cgi-bin/bfgather.exe/actionsite' At 14:32:47 -0500 - SetupListener success: IPV4/6

(imported comment written by SystemAdmin)

Boyd, you are seeing the same issue we are. It is easily reproducible (in v8.0) by restarting during a baseline with any means other than the actionscript command “restart”.

If you replace any restarts with the actionscript “restart” command, do your issues go away? They did for me. I had to set MSI installers (for example) not to reboot automatically followed by “action requires restart” and then "restart ".

Bigfix … errrr … TEM, is this on a bug list yet?

(imported comment written by MattBoyd)

Hey JonL,

I haven’t tried the action script restart command but, quite frankly, that’s not the answer. What if there’s a power failure while the client is running a baseline? It just won’t ever finish that baseline? That’s not very “robust” IMO.

(imported comment written by MattBoyd)

FYI, I just sent an e-mail to support about this.

(imported comment written by SystemAdmin)

I discussed this issue with support (Jan 3-5). They weren’t previously aware of the issue. After struggling with it for a few days, I found the work-around using the actionscript “restart”, which I shared with them. They didn’t seem especially concerned about it, but I consider it a bug. It is a legitimate concern, to your point, that should be addressed. Clients that are either accidentally or unwittingly rebooted mid-baseline will not receive the balance of the items in the baseline. Please share your results. Hopefully you get further than I got.

(imported comment written by BenKus)

Hey guys,

We had a brief internal discussion about this…

Couple notes:

  • I think the change in behavior was a result of another change where we were trying to prevent an issue where a broken action could spiral out of control by constantly re-running.
  • I think the change in behavior is probably a bug because it clearly is a behavior that people rely on…
  • We need to do an internal analysis to make sure we get this right with all factors involved.
  • As a general note, it is a good idea to use the agent’s built-in restart command instead of other methods. I would call this “best practice”.

Hopefully we will have an update soon. Aaron Bauer, our usability architect, is reviewing a lot of the details for you guys.

Ben

(imported comment written by MattBoyd)

Ben,

Thanks for the follow-up. I agree that using the built-in restart command is a good practice, but not always possible. FWIW, we rarely issue reboots in the middle of a baseline, and I first noticed the issue when a post-action reboot of a baseline caused another baseline that was running at the same time to stop working.

A big thanks to you and abauer for looking into this. Please let us know what you come up with.

(imported comment written by SystemAdmin)

Thanks for the follow-up Ben. We do have several baselines that have 1 to 3 reboots within them where the reboot is an integral part of the process flow. While I’ve worked around the issue by using the restart command, it would be great to have this resolved.

(imported comment written by SystemAdmin)

Hi guys,

A clarification and a question to make sure I’m understanding correctly.

First, what you’re seeing is that if the machine is restarted by something other than the built-in restart command, the action never retries even if you included retrying in the execution parameters. This is the new behavior, NOT the fact that the action returns failed for the first try.

Second, could you give examples of when using the built-in restart command is not possible? This would help our internal discussions.

(imported comment written by MattBoyd)

abauer

First, what you’re seeing is that if the machine is restarted by something other than the built-in restart command, the action never retries even if you included retrying in the execution parameters. This is the new behavior, NOT the fact that the action returns failed for the first try.

Correct. The action does not retry, even if retrying is included in the execution parameters.

abauer

Second, could you give examples of when using the built-in restart command is not possible? This would help our internal discussions.

-Power outages

-Restarts issued by external sources, such as the Windows Update service

-Restarts initiated by end users

-Post-action restarts (specified in the Post-Action tab) initiated by other actions

So far, I’ve been able to identify stuck Multiple Action Groups (or baselines) by the following characteristics:

The status is set to failed

The End Time is set

The action that was running is set to failed, the remaining actions are stuck at “Waiting” or “Evaluating” as their status. If the sub-actions say waiting, their detailed status is “Waiting on action dependency” as if the action group is still running and the sub-action is still waiting for its turn.

(imported comment written by SystemAdmin)

abauer, I agree with the items boyd mentioned. Just had a few additional notes.

This behavior is new in 8.0. In 7.x and earlier, a baseline would continue after ANY restart of the system.

See this post for more detailed examples: http://forum.bigfix.com/viewtopic.php?id=6386

A more standardized, integrated way to handle planned mid-baseline restarts would be helpful, especially for console operators unfamiliar with the caveats of doing it successfully.

Perhaps even posing a choice of behaviors as part of the baseline custom action settings. Those could include options for both inadvertent and planned reboots. Examples: Automatically continue or not. Continue only if a relevance condition is true. Retry or not (and number of times). Is user login required before continuing? Administrative login? Handling for UAC scenarios? Conditions to wait for before proceding (service DB unlocked, particular services started, processes in memory, time period elapsed, etc.).

While I know how to code these items into a baseline so that a planned reboot executes successfully, many console operators do not. It would be much more convenient for advanced users and terrifically helpful for more novice users to have some integrated options.

(imported comment written by MattBoyd)

So… any chance there will be fix for this soon? It’s causing our build process to be interrupted and prevents workstations from receiving all of their applications.

(imported comment written by MattBoyd)

I really wish this had been addressed in 8.1.608! We have to babysit our baselines now. Why is this critical behavior, that worked fine in 7.2.x and is so broken in 8.x, not fixed yet?