We’re experiencing a very strange issue during our patch baseline deployment on (SOME!!) of our mobile devices.
In between some (random order) patch components in the baseline, the client has trouble emptying the download folder used by bigfix (check attached log)
This happens only with one specific type of device and doesn’t always fail on all of them. Most of them succeed. But the failure rate is high enough so as to raise suspicion on the fact that there may be something intrinsically wrong. Or that another process is using the custom site download folder.
These were our troubleshooting steps:
Put a BESClient reset fixlet in the beginning of the baseline
Try to reproduce the issue and use procmon to determine what is locking/using the folder (the only process using this folder during the issue is the SYSTEM process)
Use a multiple action group instead of a baseline
After all this, we’re still unable to find the cause of this annoying issue. We tried dropping find the root cause as well and just find a workaround but nothing has worked up until now.
Guys, could you shed your light on this issue and help my colleague out? We’ve tried with a PMR, but there is still nothing concrete… We’ve received 2 weeks from our business to get this issue cleared…
What files are in the Downloads directory? Are there any?
Are any of the components using RUN instead of WAIT in the actionscript?
As @Aram points out, you often need to run diagnostic tools as SYSTEM, especially in Windows 10. I’m finding things that used to work on Windows 7 as ADMIN that only work in Windows 10 as SYSTEM. Same goes for the Fixlet Debugger when writing relevance. Most of the time I only run it as Admin, but in a few special cases, like particular Registry keys or WMI queries, it only works as SYSTEM.
As @jgstew states, the first thing to check is whether there are any files in the _Download folder. There’s a separate _Download folder for each Site on the client. The Site in use should be whatever site contains the baseline that was the source for action 307107 in this case. You’ll find the _Download folder at \Program Files (x86)\BigFix Enterprise\BigFix Client\__BESData\[sitename]\_Download.
It’s very like that the folder will still have the files in it, indicating what installer process is “stuck”. I’ve seen that occur fairly frequently in my own custom content, but also in the default BigFix content for installing Chrome, Java, and several other packages. What usually happens is that the installer has encountered some error condition, is trying to prompt for an input or acknowledge, but since it’s running under Services it does not have access to the interactive desktop and there’s nowhere for its message to go.
You’ll likely need to kill the installer process for whatever you find still in the download folder. If there are no files there at all, the installer may have moved itself to another location but still used the download folder as its “current working directory”. Some installers extract to the TEMP folder and execute from there, cleaning up the source archive that was in the download folder but if the process is still active and has the download folder as its CWD the folder can’t be cleared.
You mentioned procmon and process explorer, I’d try using “handle” (also from Sysinternals) to list open file handles and their associated processes. Using handle _Download
should show you which process has the folder locked.
The best way to fix this is to correct the underlying problem with the installer (whatever that problem may be). If that’s not possible, and the Fixlet(s) in question are custom content, you could swap out the “wait” or “waithidden” for “run” or “runhidden” and enforce a timeout, described at Running a command with a timeout
If you’re finding that this happens with a lot of different processes, with IBM default content, or for whatever reason you don’t want to handle it on a case-by-case basis, you can checkout the BESChildKiller Tasks and Analyses that I posted to BigFix.me - they’re linked later in that thread at Running a command with a timeout . It sets up a Scheduled Task to run periodically (15 minutes, if I recall) to look for BES child processes that are running longer than a specified timeout, and kills those processes. I previously had an RFE out for this functionality and it looks like it’s been accepted and scheduled for a future release, but in the meantime I’ve had this task running on a few hundred hosts for a couple of months and it’s working pretty well for me. Of course your mileage may vary…
Have you ever heard of or ever used to client setting _BESRelay_Downloads_OlderThanInDays?
What impact can it have on network level if you set the default value to 0 on your relays?
What can we expect will happen to our open actions that have been run already and might rerun if the action becomes relevant again?
That is a relay specific setting and should have no impact or benefit to clients running actions and clearing their download folders at all. Your issue is client specific and not relay specific. It is very unlikely that any setting will help at all as this should be an executable / installer / file locking / actionscript issue.
It clearly states that it’s when the server/relay is hanging, which is not the case with us.
Besides that, it was fixed in 9.5.3, which is the version we have.
So we didn’t implement this, we won’t change this on all our relays, for a solution that’s not even sure it will work.
If somebody has any genius input left, please let us know
My genius input is please answer the questions I asked above so that I can get a better understanding of what is going on so that I can help figure out what the solution might be.
Without knowing what is running before and during this problem, and without knowing what is in the downloads folder, I can’t really help you any further. The problem you are describing is specific to whatever you are running. This is not normal behavior, but it is expected if something is still running that shouldn’t be or similar. There are many possible solutions, but which one is correct or best depends entirely on what is happening.
@jgstew Let’s say we have a baseline of 10 patches, some not relevant for some machines.
The 1st and 2nd patch fixlet completed successfully in the BESlog (could also be 3rd & 4th). By completing successfully, I mean they return exit code 0 in the log:
Command started - waithidden __Download\office2003-KB3128043-FullFile-ENU.exe /q:a /r:n (group:309996,action:309999)
Command succeeded (Exit Code=0) waithidden __Download\office2003-KB3128043-FullFile-ENU.exe /q:a /r:n (group:309996,action:309999)
Command succeeded action may require restart "e0f2dff30127233ea60e67d3bdc04cc39e839958" (group:309996,action:309999)
ActionLogMessage: (group:309996,action309999) ending sub action
Not Relevant - MS16-148: Security Update for Microsoft Office - Word Viewer - KB3128043 (fixlet:309999)
Immediately after, the error starts for the next fixlet:
Eventhough there is no mention in the log, the third fixlet starts the download, because:
In the __BESData\__Global\__Download\CustomSite\ directory we have a folder, named with the id of the action of the THIRD patch.
The component right before the third patch, was the second patch. It is not linked to a specific KB, since we have had it in different baseline and different KB’s within the baselines.
During the error we can see in Procmon there is still a handle on the downloadfolder from the SECOND patch.
All used fixlets are standard from the “Patches for Windows” site and use the waithidden command.
Thank you for your feedback, we are happy to provide more info if needed.
It is not clean, but what if you add custom task to wait some time between 2nd and 3rd fixlet?
It seems some of installers returns (and exits waithidden) while some of child processes are running. It might improve the situation if you give some time to installer to really complete.
We tried running the fixlets with a pause(while process active And duration < 120 seconds) after the waithidden command as a troubleshooting measure, but the process ended, the next fixlet started and we had the issue again.
We also cannot see any other child processes using the download folder.
Including a set pause time in each fixlet is not a solution for these devices, since they are only online for a very short period of time in which they have to complete the patching baseline.
So the process “office2003-KB3128043-FullFile-ENU.exe” is still running and has a handle on the download folder after the action is marked complete? That’s what we’ve been trying determine…and this is the information that needs to be included in the PMR.
In this case you’d have several options -
Get IBM to update the content for the office2003-KB3128043-FullFile-ENU.exe installer. Maybe there are additional command-line parameters that could be used, or perhaps wrapping it in a batch script with a start /wait or something along those lines.
Create a custom copy of the fixlet and try out this kind of correction yourself.
Add an extra task after Patch 2, to delay several minutes (in case the patch really is still running) and then issue a taskkill command to kill the office2003-KB3128043-FullFile-ENU.exe process. I’m not sure this will execute though because this task itself may get stuck when the download folder cannot be cleared
Check out the BESChildKiller tasks I linked earlier in the thread, that can use the Task Scheduler to kill any “stuck” patch processes.
It’s a bit scary though that you’re still deploying Office 2003 patches…
Is it office2003-KB3128043-FullFile-ENU.exe or something else?
Can you expand upon this? I am a bit confused as to what this means.
You might be having this happen with multiple different patches but you really need to keep track of every one of them and the previous item that this is happening with because we really need that info to track down what the issue is. It is very possible that there are a small number of patches with this same problem, but I would guess that it is the same small set of them every time.
The first step is to identify the patches that are causing this problem and remove them from your patching baselines while a solution is determined to prevent this from happening.
Also, this is the folder path that it is trying to clear:
This is the folder we need to know the contents of when this happens. This is the folder we need to know what processes have handles and locks on this folder and all of its contents.
Sorry to hear about your troubles. But if you’re not giving any better information in the PMR than is being provided in this thread, then I can understand why it’s taking so long. Best of luck to you, I’m un-following this thread now.
@JasonWalker, well, it’s been three weeks, so yes, it’s taking a bit to long for us. But thank you for your input and helping my colleagues, I really appreciate it!
It will take infinite time if you don’t provide enough information. It will take a lot less time if you provide us with a lot more info and answer as many of the questions that we have posed in this thread as you can.
We can provide more help with more info. We can’t provide more help without more info.
The problem was, that we didn’t really know where to look. So that’s why it took us time to give the correct information. Now we’ve sorted that out, and stated the following:
System Process (process id 4) holding the directory “BES Client\__BESData\__Global\__Download\Updates for Windows Applications” (in this case the folder “Updates for
Windows Applications” is due to the deployment of a fixlet under that
site) and preventing BESClient process to remove the directory as per flow.
This behaviour is unexpected since the mentioned directory should be managed
only by the process BESClient.
The reason for the 2003 office patch, is that we have legacy hardware/software and the 2003 viewers are still installed on the machine (business requirements). So that’s why we still need to patch the office 2003.
We gave that as example, but in fact it’s different patches on different times that are giving the issue. So it’s not really possible to see what the issueing patch is. Besides that, we only have it on a specific 1500-2000 devices (which is 10% of our environment).
In the procmon we also see the TMBMSRV, which is from Trend Micro Anti Virus. They want us to exclude the BigFix directories from AV, but that’s strange since we are using CPM from BigFix and that is default checked (and I didn’t uncheck the checkbox, so i’m 100% sure that they are excluded in the realtime.ini file).
The only thing I can image now, is that some devices didn’t receive there settings correctly (those devices are only 10min/day online, most of the time on mobile broadband).
Thank you for your patience and help!
edit: i’ve looked further into it, and on 1 test device we uninstalled CPM and had the problem afterwards also. So we can rule out AV.