Running a command with a timeout

Line has worked in the past, yep… issue per the log appears to be relevance substitution:

Command failed (Relevance substitution error.) if {exists (ids of processes whose (name of it = "java.exe" and ppid of it = pid of process "cmd.exe")}

ActionLogMessage: (action:xxxxxxx) ending action

Yet cmd.exe and java.exe are still running…

That’s a relevance substitution error, you appear to be missing a parenthesis – you’ve got two opening and one closing parenthesis in that statement.

It’s still running because it never got to the line where it would kill the process – once you fix that you should be able to start troubleshooting the taskkill

Interesting, so it attempted to kill the (correct) PID of cmd.exe, but did so with an error:

Command started - waithidden taskkill /T /pid 3964
Command succeeded (Exit Code=128) waithidden taskkill /T /pid 3964

Think it’s referring to this?

ERROR_WAIT_NO_CHILDREN
128 (0x80)
There are no child processes to wait for

FYI - PIDs/PPIDs (at the time) for reference:

q: pid of service "BESClient"
A: 5020
T: 0.023 ms

q: pid of process "cmd.exe"
A: 3964
T: 0.059 ms

q: ppid of process "cmd.exe"
A: 5020
T: 0.024 ms

q: pid of process "java.exe"
A: 1884
T: 0.020 ms

q: ppid of process "java.exe"
A: 3964
T: 0.024 ms

q: pid of process "scap12_registry1.exe"
A: 2972
T: 0.031 ms

q: ppid of process "scap12_registry1.exe"
A: 1884
T: 0.020 ms

So:
PID of BESClient = PPID of cmd.exe
PID of cmd.exe = PPID of java.exe
PID of java.exe = PPID of scap12_registry1.exe

And hence all the relevant job processes are still running. I’m a little stumped because everything looks right… and there are child processes.

So going to try switching from this:

waithidden taskkill /T /pid {ids of processes whose (name of it = "cmd.exe" and ppid of it = pid of service "BESClient")}

To this:

waithidden taskkill /T /pid {ids of processes whose (name of it = "java.exe" and ppid of it = pid of process "cmd.exe")}

Fingers crossed…

Ha, all that was wrong was a missing /F (force):

waithidden taskkill /T /pid {ids of processes whose (name of it = “cmd.exe” and ppid of it = pid of service “BESClient”)} /F

All working now.

@strawgate - thanks a bunch for your input. Very happy with this solution as opposed to what I was doing before :slight_smile:

3 Likes

Just to close the loop on this, I now have this with a few fixlets in production, and there is at least one additional edge case to consider.
As noted earlier, this works in client 9.1 or later where pid of service "name" was added.

delete __createfile
createfile until EOF
REM Put real installer commands here
REM 'cmd /K' is a good way to test the timeout functionality
EOF
delete Installer.cmd
move __createfile Installer.cmd
parameter "StartTime"="{now}"
runhidden cmd.exe /c "Installer.cmd"

pause while {exists processes whose (name of it = "cmd.exe" and ppid of it = pid of service "BESClient") AND (now - (parameter "StartTime" as time) < 5 * minute)}
    
if {exists  processes whose (name of it = "cmd.exe" and ppid of it = pid of service "BESClient")}
		parameter "KillingItDueToTimeout"="True"
           waithidden taskkill.exe {concatenation of ("/PID " & it as string & " ") of ids of processes whose (name of it = "cmd.exe" and ppid of it = pid of service "BESClient")} /T /F
else
	parameter "KillingItDueToTimeout"="False"

endif

This has the advantage that if there is more than one “cmd.exe” process spawned off by the BESClient, it will kill each of them; otherwise it would have given a relevance substitution error due to multiple results for the PID check. I also use the “KillingItDueToTimeout” parameter just to put an entry in the BES Client log.

2 Likes

…and to reopen old wounds…

I’ve encountered cases in several different fixlets where a command issued with “wait” or “waithidden” never completes, leaving the BES client unable to process any further actions. Rather than fixing every possible edge case and update each fixlet/task individually (some of which are default content), I had an RFE to add a client setting to timeout these external processes after a period of time. While I’m waiting on that, I went ahead and built something external that I’m testing out.

Have a look at
https://bigfix.me/fixlet/details/23049
https://bigfix.me/fixlet/details/23048
https://bigfix.me/fixlet/details/23047
https://bigfix.me/analysis/details/2998503

This is a process I’m calling “BESChildKiller”. Currently for Windows-only, it sets up a Scheduled Task that periodically checks for child processes of BESClient.exe. If they exceed a timeout value, the process gets killed (I’m currently testing with a 2-hour timeout; it’s configured by client settings, so you could assign different timeouts on different hosts). There’s also an Analysis to read the logs if any processes have actually been killed. Specific external process names can be excluded (for instance “rbagent.exe” from BigFix OSD can be expected to run for long durations, so we can whitelist this process to avoid killing it).

This is very much Alpha, and very much Use At Your Own Risk. Killing processes can be dangerous. So can running Scheduled Tasks in SYSTEM context.

2 Likes

…and THIS is why it’s alpha. I wasn’t handling a condition where the BESClient’s PID formerly belonged to another process, and this task now tries to kill children of processes that don’t belong to BESClient. I should be able to post a fix to bigfix.me tomorrow.

2 Likes

There is a rare edge case to consider when using {now} for pause while.

What if the clients clock is changed to be in the past?

One possible solution is to use the absolute value of (now-parameter) so that if the jump in the past is bigger than the wait time, it will be triggered and move on.

Another possibility that can be combined with the above is to always use {apparent registration server time} instead of {now} which should be approximately equal to {now} but is also garunteed to never change to the past unless relay selection occurs AND the relay’s clock value is changed to be in the past, and even in this case, if using the absolute value, then the threshold is a maximum of 3x the expected wait time.

This edge case should be so rare that fixing it everywhere is hardly worth it, but this is a pretty good solution to it.

2 Likes

I think I found a possible solution to running a command on Windows with a timeout.

Commnad: cscript //?

Output:

Microsoft (R) Windows Script Host Version 5.812
Copyright (C) Microsoft Corporation. All rights reserved.

Usage: CScript scriptname.extension [option...] [arguments...]

Options:
 //B         Batch mode: Suppresses script errors and prompts from displaying
 //D         Enable Active Debugging
 //E:engine  Use engine for executing script
 //H:CScript Changes the default script host to CScript.exe
 //H:WScript Changes the default script host to WScript.exe (default)
 //I         Interactive mode (default, opposite of //B)
 //Job:xxxx  Execute a WSF job
 //Logo      Display logo (default)
 //Nologo    Prevent logo display: No banner will be shown at execution time
 //S         Save current command line options for this user
 //T:nn      Time out in seconds:  Maximum time a script is permitted to run
 //X         Execute script in debugger
 //U         Use Unicode for redirected I/O from the console

The important bit: //T:nn Time out in seconds: Maximum time a script is permitted to run

It should be possible to invoke nearly anything using CScript with a built in timeout.

This would not address content written by others.

We actually ran into an issue recently with cscript’s built in timeout. We had a vbs that called an external exe that sometimes got stuck. So we added the timeout argument thinking that it would help, but it still hung. We ended up adding some timeout code to the script that worked. Link is here if anyone needs it, but it’s only applicable if you are actually writing vbscript. So your mileage may vary if you are attempting to use cscript as a general timeout solution.

1 Like

Looks like a nice, short, elegant solution. Is there a reason you chose to exec ‘waitfor’ rather than using wscript.sleep() ?

1 Like

That’s not my code, just a snippet I found online when looking into the issue with the native cscript timeout. It’s funny that you mention that though, I had the same question. And I just looked at my code and it looks like I did change waitfor to timeout. I have used timeout numerous times but have never really used waitfor. I’m sure wscript.sleep() could work the same way.

1 Like

Seems like BESClientUI.exe should ALWAYS be excluded from being killed automatically.

names of items 1 of (it, processes) whose(item 0 of it = ppid of item 1 of it) of pids of (processes "BESClient.exe" ; processes "BESClient")

This relevance returns BESClientUI.exe for me. The new SSA app might be another case.

Nice catch; I’m handling BESClientUI.exe already but I don’t have the SSA app. If it runs a separate process on the client it should also be excluded.

This is related to the content at https://bigfix.me/fixlet/details/23050 which sets up the Scheduled Task and actually runs the process check query. From 23050, it doesn’t display well via web but the QNA query file that is generated contains

// The following file will be a query passed to qna.exe.  It should output the list of process IDs to terminate -
appendfile Q: /* Do not remove this comment Version:1.3 */ pids of it of processes whose ((ppid of it = pid of service "BESClient") and (name of it != "BESClientUI.exe") AND (creation time of it > creation time of process (pid of service "BESClient")) and (now - creation time of it > value of setting "BESClient_ChildProcess_Timeout" of client as time interval) and (if not exists setting "BESClient_ChildProcess_ExcludeList" of client then true else (pid of it is not contained by  set of (pid of it; pids of processes (ppids of it); pids of processes (ppids of processes (ppids of it))) of processes whose (name of it as lowercase is contained by (set of (substrings separated by ";" of  value of setting "BESClient_ChildProcess_ExcludeList" of client as trimmed string as lowercase))))))

Then intent is to exclude BESClientUI.exe, and additionally if there is a BESClient_ChildProcess_ExcludeList, then exclude any process or child process of an item in the ExcludeList.

1 Like

I’m trying to come up with a generic option to close any subprocess of BESClient that isn’t the UI or SSA. I’m not currently concerned with deeper levels of processes spawned by CMD though maybe I should be for the Exclusion option to work correctly.

I’m currently doing this to be included within an action myself, so really a modified approach to: Running a command with a timeout - #26 by JasonWalker

Since I am intending this to use in an action, I probably don’t need BESClient_ChildProcess_ExcludeList except to be able to add things like SSA without modifying the relevance.

names of items 1 of (it, items 1 of (it, processes) whose(item 0 of it = ppid of item 1 of it) of pids of (processes "BESClient.exe" ; processes "BESClient") ) whose(name of item 1 of it is not contained by item 0 of it) of sets of ( "BESClientUI.exe" ; substrings separated by ";" of values of settings "BESClient_ChildProcess_ExcludeList" of clients )

The goal is a piece of actionscript that could be added to the end of any action with the only modifications to the action being changing wait to run… assuming that only 1 thing is being “run”.


What does this achieve? Is this because you need to exclude processes with a ppid that just happens to match that of the BES Client, but might have been run by something else before it?


When wanting to timeout child processes in an action context, then this seems like it would be useful:

processes whose(creation time of it > active start time of active action)

This should work more narrowly inside of an action, but more broadly outside of an action:

names of items 1 of (it, processes ) whose( creation time of item 1 of it > (active start time of active action | creation time of item 0 of it ) AND pid of item 0 of it = ppid of item 1 of it) of (processes "BESClient.exe" ; processes "BESClient")

Since this lacks any exclusions, it should return BESClientUI.exe on windows. It seems like the creation time check is faster than the ppid check.

Q: number of creation times of processes
A: 147
T: 3.025 ms

Q: number of ppids of processes
A: 148
T: 327.941 ms

It is a good idea to use number of to get the relative speed of each inspector, which dictates which order they should appear in a complex relevance clause for maximum performance. In this case, creation time seems 100 times faster.

Related: https://bigfix.me/relevance/details/2999306


This is even better, and more cross platform: (which is why I’m using processes instead of services)

names of items 2 of (pid of it, (active start time of active action | ( if (exists properties whose(it as string starts with "creation time of <process>")) then creation time of it else start time of it ) of it), processes ) whose( creation time of item 2 of it > item 1 of it AND item 0 of it = ppid of item 2 of it) of (processes "BESClient.exe" ; processes "BESClient")

I have no idea why there is a difference between creation time of process and start time of process other than that they are supported on different platforms. Seems like they should be aliases of each other and available on all platforms where either is supported… which is probably a question for: @AlanM


Yet another improvement:

names of items 3 of (pid of it, (active start time of active action | ( if (exists properties whose(it as string starts with "creation time of <process>")) then creation time of it else start time of it ) of it), set of ("BESClientUI.exe";"BESClientUI"), processes ) whose( creation time of item 3 of it > item 1 of it AND name of item 3 of it is not contained by item 2 of it AND item 0 of it = ppid of item 3 of it) of (processes "BESClient.exe" ; processes "BESClient")

Note: active start time of active action should always work on all platforms in an action context, while neither creation time of process and start time of process will work on Mac currently.

Thanks James, there is a lot of insight in there!

Exactly. I had some cases when testing. During Windows startup, a process (SMSS I believe) had spawned another process (CSRSS.exe). The SMSS process then terminated, and when the BES Client service started it reused the PID number from SMSS. Later the childkiller process was incorrectly trying to terminate CSRSS because the ppid matched BESClient’s PID.[quote=“jgstew, post:36, topic:15467”]
This should work more narrowly inside of an action, but more broadly outside of an action:
[/quote]

I like your evolved queries and should update my BESChildKiller to use it. I’m mostly running outside of an action context, because I’m trying to overcome problems with the default content that sometimes leave the system hung. The Chrome and Java upgrade fixlets in particular seem to hang with regularity, so I need the poller running outside of the client context.

For me this is a stopgap, and I’m hoping I don’t need to keep doing it for much longer. My RFE to add this function to the client was accepted and the status is “Scheduled for future release”…

1 Like

I haven’t tested this, but here is my actionscript I’m working with:

// log SMBv1 connections to a file
runhidden CMD /C (for %i in ({ concatenations " " of items 0 of (it, (set of unique values whose(it != "") of (it as trimmed string) of lines of files "results_SMBv1_ServerNames.log" of folders "Logs" of folders "__Global" of data folders of clients) | (set of "") ) whose(item 0 of it is not contained by item 1 of it) of (unique values of (it as trimmed string as lowercase) of string values of properties "ServerName" of select objects "Dialect,ServerName from MSFT_SmbConnection" whose ((string value of property "dialect" of it) as version < "2.0") of wmis "Root\Microsoft\Windows\SMB") }) do @echo %~i >> "{ pathnames of folders "Logs" of folders "__Global" of data folders of clients }\results_SMBv1_ServerNames.log")


// The following is to provide timeout functionality in case CMD hangs
pause while { now < (active start time of active action + 5 * minute) AND exists items 3 of (pid of it, (active start time of active action | ( if (exists properties whose(it as string starts with "creation time of <process>")) then creation time of it else start time of it ) of it), set of ("BESClientUI.exe";"BESClientUI"), processes ) whose( creation time of item 3 of it > item 1 of it AND name of item 3 of it is not contained by item 2 of it AND item 0 of it = ppid of item 3 of it) of (processes "BESClient.exe" ; processes "BESClient") }

if {exists items 3 of (pid of it, (active start time of active action | ( if (exists properties whose(it as string starts with "creation time of <process>")) then creation time of it else start time of it ) of it), set of ("BESClientUI.exe";"BESClientUI"), processes ) whose( creation time of item 3 of it > item 1 of it AND name of item 3 of it is not contained by item 2 of it AND item 0 of it = ppid of item 3 of it) of (processes "BESClient.exe" ; processes "BESClient") }
	parameter "KillingItDueToTimeout"="True"
	waithidden Taskkill /T /F { ("/PID " & it) of concatenations " /PID " of (it as string) of pids of items 3 of (pid of it, (active start time of active action | ( if (exists properties whose(it as string starts with "creation time of <process>")) then creation time of it else start time of it ) of it), set of ("BESClientUI.exe";"BESClientUI"), processes ) whose( creation time of item 3 of it > item 1 of it AND name of item 3 of it is not contained by item 2 of it AND item 0 of it = ppid of item 3 of it) of (processes "BESClient.exe" ; processes "BESClient") }
else
	parameter "KillingItDueToTimeout"="False"
endif

NOTE: using active start time of active action instead of (parameter "StartTime" as time) only makes sense if the action only does 1 thing, otherwise your setting the timeout based upon the total time for all executions, not just the last one which is a lot harder to control well.

1 Like

about the BESChildProcessKiller. I’ve downloaded the fixlet and tested deploy on two machines but getting Not Relevant…why?

Now this functionality is available natively in the client. For more details, see

1 Like