Hey folks!
After spending some time tracking down and resolving unusual Mailbox-related issues, I though it would be beneficial to the greater BigFix user community to share the knowledge gained. If you find this type of information useful and would like to see more, please leave your comments below and I may start doing regular postings on subjects I have encountered while supporting many of the largest BigFix deployments in existence.
First, lets start with a couple of properties which are invaluable to have globally enabled in your environment when troubleshooting anything Mailbox related.
Client Mailbox Version
Identify and return the most current version of the Mailbox site the Client has gathered and is using, or nothing in the event of no Mailbox site (this is a valid state in the event no Mailboxed action was taken against this Client)
if (exists site version lists of sites whose("mailboxsite" = name of it)) then (maxima of site version lists of sites whose("mailboxsite" = name of it)) else (nothing)
Relay Client Mailbox Versions
Extract and return a listing of all Client Mailbox sites cached on a Relay in the format: <ComputerID>|<MailboxSiteVersion>|<MailboxSiteTimestamp>
if (exists setting "_BESRelay_HTTPServer_ServerRootPath" of client and exists value of setting "_BESRelay_HTTPServer_ServerRootPath" of client and exists folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client) and exists folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client) and exists file "Mailboxes.db" of folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) then ((column 0 of it as string & "|" & hexadecimal integer (column 1 of it as string) as string & "|" & column 2 of it as string) of rows of statement ("SELECT COMPUTERS.ComputerID, (SUBSTR(HEX(COMPUTERS.Version), 9, 2) || SUBSTR(HEX(COMPUTERS.Version), 7, 2) || SUBSTR(HEX(COMPUTERS.Version), 5, 2) || SUBSTR(HEX(COMPUTERS.Version), 3, 2)) AS Version, DATETIME(COMPUTERS.Timestamp, 'unixepoch') AS Timestamp FROM COMPUTERS WHERE COMPUTERS.Timestamp > 0 ORDER BY COMPUTERS.ComputerID") of sqlite database of file "Mailboxes.db" of folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) else (nothing)
With these activated, you can begin to identify inconsistencies in your environment by finding Clients whose Mailbox Site versions deviate from the version their Relay has (indicates there might be a problem gathering and/or processing the Mailbox site), or between Relays (indicating there might be a problem with corruption of the Mailboxes.db file on the Relay or a problem gathering and/or processing the Mailbox sync requests). Additionally, using the following T-SQL query you can extract the Mailbox Site versions the Root Server believes is most current, and leverage this against the results of your two properties to identify additional out-of-sync issues in your environment.
SELECT [MC].[ComputerID], CAST([MC].[Version] AS INT) AS [MailboxVersion], [MC].[Timestamp] FROM [BFEnterprise].[dbo].[MAILBOX_COMPUTERS] AS [MC] WITH (NOLOCK)
Note: Whenever running queries directly against the Root Server Database with BigFix Server services active, try and leverage the WITH (NOLOCK) T-SQL option. This will allow the BigFix Server services to continue to operate unimpeded by your actions, otherwise it is possible to cause database locks causing the various service processes to stop until your queries are completed.
If analysis has determined there is something strange going on with a Relay’s Mailboxes.db file, this can usually be resolved by:
- Stop the BESRelay service on the machine having the problem
- Create a backup copy of the Mailboxes.db file, typically found at \Program Files (x86)\BigFix Enterprise\BES
Relay\wwwrootbes\mailbox\Mailboxes.db - Delete the Mailboxes.db file
- Start the BESRelay service
When you operate a relatively large BigFix environment, you might start seeing in your Top Level Relay (TLR) log files lines similar to the following:
ProcessMailboxQueue (8524) - Error running task ProcessMailboxQueue: HTTP Error 28: Timeout was reached
and corresponding lines similar to the following in your Root Server log:
/mailbox-request (2076) - Uncaught exception in plugin mailbox-request with client <RELAY_NAME>: Socket Error: Windows Error 0x2745: An established connection was aborted by the software in your host machine.
This is indicating the Relay is unable to complete it’s Mailbox refresh (or build) process because the Root Server is taking too long to generate a complete response. Normally this process should complete relatively fast, however when the Relay is requesting the Mailbox site information for 20,000+ Clients and something is not operating properly on the Root Server, the Relay will trigger the timeout condition resulting in the entire process to abort. The Relay will re-try again in the future (as seen in production, roughly every 30 minutes), but if the situation causing the issue on the Root Server is not resolved, the Relay will never complete the Mailbox refresh (or build) process and as such, downstream Clients may not receive new Mailboxed actions. There are some cases where the Relay will request the Mailbox information for a smaller number of Clients (often because of a new Action being taken and the Relay getting to perform the process using a smaller subset of Clients), but what we have found in practice is it is highly inconsistent with no way of knowing which Actions will make it thru. Setting the following Client setting on the Relay having the problem to a value higher than the default (10 seconds), then restarting the BESRelay service might be enough to get the problem resolved:
_HTTPRequestSender_Connect_TimeoutSeconds
Note: This setting effects other operations of the Relay and leaving it defined at a value other than the default could cause other issues down the line. I would only suggest adjusting the setting while actively troubleshooting an issue, making only modest increases until the Relay is able to get past the timeout problem, then dropping it back down when done.
When there is a problem on the Root Server where a Mailbox file (e.g. Action XYZ.fxf which represents the actual action) is no longer found on the disk (or is otherwise inaccessible by the Root Server or corrupted), you will see an entry similar to the following in the Relay log file:
ProcessMailboxQueue (5260) - Error running task ProcessMailboxQueue: Unexpected HTTP response: 404 Not Found
This problem is actually worse than it might initially appear for 2 reasons. First, you might not even see these error messages in your Relay log because the timeout problem masks this issue (which may in fact be the root cause of the slow Root Server response). Second, the Root Server has NO WAY to recover nor even report on missing or damaged Mailbox files (yes, you read that correctly!) Also, because of the way the Relay performs it’s refresh (or build) process, when it encounters one of these missing files, it aborts the entire transaction and will not move past until the referenced file either becomes available again on the Root Server, or the Action which represents the file is stopped and re-taken in the Console.
To identify which file (Action) is causing the issue, you can use one of two methods:
- Perform a packet capture (i.e. Wireshark) on the Relay, filtering results looking for any HTTP requests going to the Root Server with “mailbox/files” in the URI, resulting in a HTTP 404 response. Once you see the SHA1 name of the file being requested, you can then search the MAILBOX_FILES table in the database for that SHA1 entry and see what the FileID value is. This should be the same as an Action ID you can lookup in the console. You can use the following T-SQL command to locate the record:
SELECT [FileID] FROM [BFEnterprise].[dbo].[MAILBOX_FILES] WHERE CONVERT(VARCHAR(MAX), [SHA1], 2) = N'<DiscoveredSHA1FilenameFromPacketCapture>'
- Perform a dump of the records stored in the MAILBOX_FILES table, then enumerate through the disk on the Root Server until you discover a referenced file (using the SHA1 value stored in the database as the filename) not present. You can use the following T-SQL command to generate the listing of files which are expected to exist in the BES Server\wwwrootbes\mailbox\files directory and the corresponding Action ID you can use to lookup in the console:
SELECT CONVERT(VARCHAR(MAX), [SHA1], 2) AS [Filename], [FileID] FROM [BFEnterprise].[dbo].[MAILBOX_FILES]
Once you have located the problem Action, from within the Console:
- Stop the problem Action
- Copy the Action to open the Take Action dialog window
- Ensure the targeting and all other options are set as needed, then take the Action
Repeat this until all missing files referenced have been removed from the database and replaced with new files representing the Action.
Although not currently documented by IBM anywhere public, the limitation of the Root Server to re-create these Mailbox files is something that needs to be accommodated for in your risk mitigation plans. In addition to the handful of key files IBM already documents and your daily database backups, you should now also include creating backup copies of all files stored in the BES Root Server\wwwrootbes\mailbox\files directory. This is also a very important step to perform when migrating server hardware since failure to copy these files over to the new server will result in possible issues down the road if a Relay ever has to re-build it’s Mailboxes.db.
I have opened up a RFE with IBM to see about getting this capability (or at least a built-in capability to report on the issue) into the product. If you agree this is something fairly important to have, please head over to http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=93824 and vote this RFE up.
To make troubleshooting easier and to really grasp the entirety of the problem, we developed a stand-alone application which captures much of the above (and more) information by querying the data directly from the database, builds a complete relay path hierarchy for each Client, then performs analysis to discover and report on any anomalies detected. By doing this, we can now automate the discovery of problems before they become more wide-spread or impact operations (e.g. a scheduled action doesn’t actually run because Clients can’t gather their Mailbox) and focus our resources directly when and where needed. If there is any demand for the tool, we may release it publicly for all to use.