Timeout disposition

MattPeterson · November 4, 2020, 10:10pm

I’m attempting to add a timeout to wait commands in actions to prevent actions from creating hung processes. I found the override command along with timeout_seconds and disposition should do the trick. In testing I found that in Windows this seems to work as expected, the cmd is killed after the timeout, however in RHEL it doesn’t seem to kill the child processes.

Here is a example of the Action script:

delete __createfile
createfile until end
#!/bin/sh
echo sleep start date >> /tmp/test.log 2>&1
sleep 600
echo sleep finished date >> /tmp/test.log 2>&1
end

delete run.sh
move __createfile run.sh
wait chmod 555 run.sh
override wait
timeout_seconds=60
disposition=terminate
wait ./run.sh

When running this action I see the action ends with status timeout reached, which is good, however the sleep process continues running on the endpoint. Shouldn’t disposition=terminate kill sleep as well since run.sh was the parent process?

ps output showing run.sh executing:

10085 2187 0 17:06 ? 00:00:00 /bin/sh ./run.sh
10087 10085 0 17:06 ? 00:00:00 sleep 600

ps output after timeout showing sleep still running

10087 1 0 17:06 ? 00:00:00 sleep 600

RhondaSTK_HCL · November 4, 2020, 10:39pm

@cmcannady any chance you’d be able to help Matt out on this question?

MattPeterson · November 4, 2020, 10:43pm

Also see Actionscript Wait timeout on hanging process

I may have to do the pause while and kill method, but was hoping it could be simplified with the override options.

Is what I’m seeing a bug, by design, or am I doing something wrong?

cmcannady · November 5, 2020, 3:38pm

@MattPeterson, the only component of the override that I don’t see addressed in your Action Script is the completion configuration. From the online documentation, this is an integral component.

On UNIX/Linux platforms session IDs are used to manage job processes. Session IDs take on the value of the process id of the session leader (the process you want to launch). The client waits for the leader process to end, as in the Completion=process case, then begins a cycle of a half-second of sleep followed by enumerating processes looking for anything with a session id matching the job leader’s process id. When no more of these processes exist, the job is complete and the command finishes.

The exit code returned with the command is always that of the leader process, not the last process to complete.

@AlanM, can you please weigh-in regarding this matter from the BES agent perspective?

MattPeterson · November 5, 2020, 6:55pm

I see the same behavior when using completion=job, it seems the default is process.

FDA · November 9, 2020, 10:35am

It works on my RHEL test machine. To make any troubleshooting easier, please first create the test process on the target machine, I use this:

#!/bin/sh
echo sleep start date >> /TEM/ciclo.log 2>&1
date >> /TEM/ciclo.log 2>&1
sleep 60
echo sleep finished date >> /TEM/ciclo.log 2>&1
date >> /TEM/ciclo.log 2>&1

Then I run the following simple “override” action

// Open a program on target system.
// Thread execution failed due to timeout, process is terminated.
override wait
timeout_seconds=10
disposition=terminate
wait “/TEM/ciclo.sh”

The agent log shows the following message:

ActionLogMessage: (action:681) Action signature verified for Execution
ActionLogMessage: (action:681) starting action
At 11:25:15 +0100 - actionsite (http://:52311/cgi-bin/bfgather.exe/actionsite)
Command succeeded override wait (action:681)
Command succeeded override timeout_seconds=10 (action:681)
Command succeeded override disposition=terminate (action:681)
Command started - wait “/TEM/ciclo.sh” (action:681)
At 11:25:15 +0100 -
Encrypted Report posted successfully
At 11:25:25 +0100 - actionsite (http://:52311/cgi-bin/bfgather.exe/actionsite)
Command failed (process killed after timeout) wait “/TEM/ciclo.sh” (action:681)
At 11:25:25 +0100 -
ActionLogMessage: (action:681) ending action
At 11:25:25 +0100 - mailboxsite (http://:52311/cgi-bin/bfgather.exe/mailboxsite1089733827)
Not Relevant - Terminate_timeout_Linux (fixlet:681)

The script log shows the following to prove it was terminated:

sleep start date
Mon Nov 9 11:25:15 CET 2020

MattPeterson · November 12, 2020, 7:54pm

That’s basically the same test my action script was doing, and I saw the same result that you did. The script is terminated, the issue is that the child process created from the script (sleep) was not terminated.

In windows running the same test using the timeout command in a cmd script. I see the timeout process is terminated after the timeout window.

I would like to see the same behavior in Linux that I do in Windows, The child processes spawned from the terminated script should also be terminated.

cmcannady · November 12, 2020, 7:58pm

@MattPeterson, would you be willing to open a support case with L2 regarding this matter? Once opened, please PM me the case number and I’ll add myself as a watcher and raise to our agent architect. Thank you.

DanieleColi · November 17, 2020, 2:51pm

It’ by design to leave the behavior of child processes spawned depending on how the parent process handles its child. On your examples, the Linux and Windows systems behave differently, but for instance, if you rewrite your Windows example this way

@echo off
echo "starting"
start cmd /c "timeout.exe /t 60>nul"
echo “finish”

you will end with the timeout process not killed.

MattPeterson · November 17, 2020, 4:37pm

My use case is using a wrapper script to call another script like what’s provided in the software distribution templates.

In Windows the spawned script is terminated using the timeout options, but in RHEL it remains running.

MattPeterson · November 19, 2020, 3:12pm

I opened an idea to have terminate kill spawned processes as well here:

https://bigfix-ideas.hcltechsw.com/ideas/BFP-I-142