Killing unresponsive besclient process

in ref. to old post - How do I kill an unresponsive action

For window we can understand if found process kill it after certain time but for Linux/Unix if we are running any script, in such cases BESClient is just a carrier & zombie process are being generated on BESClient itself.

How to deal with such situation? script have cut over to kill itself but still besclient due to some reason dont stop & keep on running for N numbers of hour so how we can put something to kill besclient zombie process.

Assuming a current client version, check _BESClient_ActionManager_OverrideTimeoutSeconds at https://help.hcltechsw.com/bigfix/9.5/platform/Platform/Config/r_client_set.html

2 Likes

Thanks @JasonWalker, its something new but we can’t set or use it for all machines, it could be possible there are multiple scripts which is being executed on multiple servers & each script execution is different from each other hence setting client setting & binding clients with one specific time frame can create issues.

I guess putting something in action script for that specific action/script that would be much useful than this client setting.

Ok, then, if it’s for a custom fixlet…by far your best option is to build content that doesn’t hang or prompt for user input :slight_smile:
But to handle the condition gracefully, have a look at https://developer.bigfix.com/action-script/reference/execution/override.html and the timeout_seconds parameter.

2 Likes

Thanks @JasonWalker ! related that client setting, if you have used, how it react to patching jobs.

I tried to test wait override as below, action is stopped but the cmd is still running. Is this wrong process or I am missing anything here.

After 5 second action & any running child object should be terminated , right ?

Command succeeded override wait
Command succeeded override timeout_seconds=5
Command succeeded override disposition=terminate
Command started - wait cmd.exe /C ping 192.168.1.5 -t
Command failed (Thread execution failed due to timeout, process was terminated.) wait cmd.exe /C ping 192.168.1.5 -t

You should set the timeout for ALL machines to some large value, like 2 hours as a failsafe. It is true you wouldn’t want to set the timeout for ALL machines to some small value, but pick what you think should be safe enough for all cases and set it to that. at least set it to SOMETHING.

1 Like

@jgstew thanks I have taken that in consideration.

Can you please also take a look why this timeout not clicking running CMD after 5 sec.

It will only kill the cmd.exe launched by the action. Are you certain that the cmd.exe you are seeing is not another instance of cmd? Use Process Monitor (from www.microsoft.com/sysinternals ) to display the parent process of cmd.exe.

no @JasonWalker its not closing that specific CMD which was launched with this action. no other CMD is there that pinging that test IP.

By the way I am just testing it with Fixlet Debugger.

Oh I haven’t actually tried that in the Debugger. Can you try in a real test action?

2 Likes

I would NOT assume that the fixlet debugger works properly with all override commands and other nuances. It is really for testing the basics. The client execution environment will always be a bit different.

1 Like

Something I wanted to clarify about this:

This timeout should not depend on how long the action takes to execute, but how long a specific wait or run command takes to execute. You could have a baseline or an action that takes 5 hours to run for some unknown reason, but as long as no single command takes more than 2 hours to run within that execution, then the timeout should NOT be triggered.

This is again why I emphasize, that the timeout should always be set to something, it is just a matter of how long. 1 hour, 2 hours, 6 hours, SOMETHING.

Generally if you are going to run a background long running process with BigFix, then it is best to trigger it with a “run” command and let BigFix move on to other actions. (you have to be careful with this kind of thing, not run it in the __Download folder, etc…) So even in cases where you might run a background AV scan, I would highly recommend kicking it off with bigfix, then gather the results later with a separate action, or just analysis.

1 Like

how to use override wait with multiple script execution within same action, each script has its own process to execute and I want to set different override wait for each one of them, I cant separate or bind them into multiple task.

You have to use override before every invocation of wait or run and set the timeout each time.

Your deleted post seems correct.

1 Like

hahhaha I thought might be doing it wrong way thats why :slight_smile:

below is the flow which I tried but its getting cutoff on script1 execution only, not proceeding with script 2 & 3.

//script1
override wait
timeout_seconds=40
disposition=terminate
wait cmd.exe /C ping 192.168.1.1 -t

//script2
override wait
timeout_seconds=10
disposition=terminate
wait cmd.exe /C ping 192.168.1.5 -t

//script3
override wait
timeout_seconds=60
disposition=terminate
wait cmd.exe /C ping 192.168.1.8 -t

    Line 165:    Command succeeded override wait (action:2945)
	Line 166:    Command succeeded override timeout_seconds=40 (action:2945)
	Line 167:    Command succeeded override disposition=terminate (action:2945)
	Line 168:    Command started - wait cmd.exe /C ping 192.168.1.1 -t (action:2945)
	Line 176:    Command failed (Thread execution failed due to timeout, process was terminated.) wait cmd.exe /C ping 192.168.1.1 -t (action:2945)
	Line 178:    ActionLogMessage: (action:2945) ending action
	Line 180:    Not Relevant - Custom Action (fixlet:2945)

it should not proceed. It assumes that a hung process that hits the timeout is a hard failure.

If you want the commands to run fully independently, then they should be broken up into separate actions. If the commands are NOT independent, then the hard failure on timeout is CORRECT, and you should instead use separate actions with relevance to detect the previous step has completed successfully. There is an option to, at timeout, allow the process to continue to run instead of terminating it, but then you get into a state in which you have orphaned processes running forever that you should clean up, BUT you could clean them up manually after the timeout subsequently in the action, but that is messy.

In general, bigfix is best if you can break up things into as many individual fixlets/tasks as possible, with relevance to detect success or failure of each step independently. It requires writing more relevance, but it ends up giving much better feedback over time of the actual state of things, especially if you want to FORCE a configuration, but the configuration actually has many sub parts.

Trying to think of a good published example of this.

1 Like

Just to add to that, if you have a condition you expect could sometimes hang and want to handle it yourself, you could use ‘run’ instead of ‘wait’. Something like

parameter "waittime"="{now}"
run c:\someprocess.exe
pause while {exists running process "someprocess.exe" AND (now - parameter "waittime" of action as time < 10 * minute)}
if {exists running process "someprocess.exe"}
waithidden taskkill.exe /I'm some process.exe
//Do other stuff
Endif

The execution timeout is to give the client a way to recover if a wait process never terminates. Otherwise the client would stay stuck and not process other actions.

2 Likes

For windows this will work but i am running through multiple shell scripts, in that way besclient is just carrier and zombie process is under besclient name, how to deal with that.

Use the equivalent in bash (pkill? instead of taskkill), or this:

I am not certain if this would work, but you could capture the current value of _BESClient_ActionManager_OverrideTimeoutSeconds at the top of the action, or a default value of 2 hours if unset, then set it to something lower, like 240 seconds after that, then set it back to the previous value or 2 hours at the end.

The only issue is that the client might not pick up the changes immediately on change, so I don’t know how reliable that would be, plus if you set it too low, it triggers, then not get set back to the higher value at the end because the action stops, which would be a pain.


There was another forum thread about all this before this new setting was available all about killing all child processes of BESClient PID.

1 Like