Why when running a baseline do some endpoints finish and others fail?

BCannon · January 18, 2018, 11:07pm

My title should pretty much sum it up. I run a baseline, some machines run just fine and update, others fail. My main question is, how do I find out why they failed so I can adjust whatever it is that is failing?

On a side note, I’ve seen it where some of the fixlets have no action selected and then the whole baseline fails, but that’s not the case here. These are critical and important updates.

mwolff · January 19, 2018, 1:15pm

You’ll most likely have to break it down by action and look at exit codes.

When doing the monthly patch releases from Microsoft I always do a pilot audience first, to see if there are any issues with the patch content. Most times there isn’t, but every now and again a relevance error sneaks in and causes the action to fail even though the patch was successfully applied (exit code 0, but relevance still evaluates to true, so action is considered Failed). Case in point, this happened with one of the January 2018 updates. Again, though, this is rare.

A common issue I see in our environment is exit code 112; this just means the system doesn’t have enough free space to perform the update. I have an IBM Web Report set up for server admins to check when their servers go below a certain threshold; even though we have HP OMI monitoring for disk space, some of the monthly rollup patches are now getting so big that having just a Gb free on the system drive is no longer sufficient.

Other issues are the extremely helpful 1603, which means anything could have gone wrong, or 1618 which means another install is already running.

Based on experiences with patching Microsoft systems (and I not that you didn’t specify if this was Unix, Linux, or MS, or something else) we’ve taken certain actions such as rebooting systems prior to beginning patching so we can clear out existing, pending installs, waiting until the baseline finishes before restarting again, and similar types of preventative maintenance actions.

In short, there’s no easy way to figure this out, and you’ll most likely have to do some digging.

BCannon · January 19, 2018, 4:51pm

Thanks Martin. I figured as such. If I do have an exit code I’ve tried to look them up. The odd thing is when there is no exit code, which has been the case lately. I’ll keep digging. Seems like I might be seeing a pattern with the group of machines that are failing. We’ll see what I find.

mwolff · January 19, 2018, 5:04pm

If there’s no exit code, what does the action info tab tell you? You should usually be able to see something like this:

BCannon · January 23, 2018, 3:45pm

Well, now I feel dumb. I just discovered that one of the fixlets on the baseline that started this question hadn’t had an action selected which of course will make things fail. So, I’ll go back in my corner and think about what I’ve done.