Mailbox Issue Troubleshooting

anon87818915 · August 31, 2016, 2:14pm

Hey folks!

After spending some time tracking down and resolving unusual Mailbox-related issues, I though it would be beneficial to the greater BigFix user community to share the knowledge gained. If you find this type of information useful and would like to see more, please leave your comments below and I may start doing regular postings on subjects I have encountered while supporting many of the largest BigFix deployments in existence.

First, lets start with a couple of properties which are invaluable to have globally enabled in your environment when troubleshooting anything Mailbox related.

Client Mailbox Version
Identify and return the most current version of the Mailbox site the Client has gathered and is using, or nothing in the event of no Mailbox site (this is a valid state in the event no Mailboxed action was taken against this Client)

if (exists site version lists of sites whose("mailboxsite" = name of it)) then (maxima of site version lists of sites whose("mailboxsite" = name of it)) else (nothing)

Relay Client Mailbox Versions
Extract and return a listing of all Client Mailbox sites cached on a Relay in the format: <ComputerID>|<MailboxSiteVersion>|<MailboxSiteTimestamp>

if (exists setting "_BESRelay_HTTPServer_ServerRootPath" of client and exists value of setting "_BESRelay_HTTPServer_ServerRootPath" of client and exists folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client) and exists folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client) and exists file "Mailboxes.db" of folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) then ((column 0 of it as string & "|" & hexadecimal integer (column 1 of it as string) as string & "|" & column 2 of it as string) of rows of statement ("SELECT COMPUTERS.ComputerID, (SUBSTR(HEX(COMPUTERS.Version), 9, 2) || SUBSTR(HEX(COMPUTERS.Version), 7, 2) || SUBSTR(HEX(COMPUTERS.Version), 5, 2) || SUBSTR(HEX(COMPUTERS.Version), 3, 2)) AS Version, DATETIME(COMPUTERS.Timestamp, 'unixepoch') AS Timestamp FROM COMPUTERS WHERE COMPUTERS.Timestamp > 0 ORDER BY COMPUTERS.ComputerID") of sqlite database of file "Mailboxes.db" of folder "mailbox" of folder (value of setting "_BESRelay_HTTPServer_ServerRootPath" of client)) else (nothing)

With these activated, you can begin to identify inconsistencies in your environment by finding Clients whose Mailbox Site versions deviate from the version their Relay has (indicates there might be a problem gathering and/or processing the Mailbox site), or between Relays (indicating there might be a problem with corruption of the Mailboxes.db file on the Relay or a problem gathering and/or processing the Mailbox sync requests). Additionally, using the following T-SQL query you can extract the Mailbox Site versions the Root Server believes is most current, and leverage this against the results of your two properties to identify additional out-of-sync issues in your environment.

SELECT [MC].[ComputerID], CAST([MC].[Version] AS INT) AS [MailboxVersion], [MC].[Timestamp] FROM [BFEnterprise].[dbo].[MAILBOX_COMPUTERS] AS [MC] WITH (NOLOCK)

Note: Whenever running queries directly against the Root Server Database with BigFix Server services active, try and leverage the WITH (NOLOCK) T-SQL option. This will allow the BigFix Server services to continue to operate unimpeded by your actions, otherwise it is possible to cause database locks causing the various service processes to stop until your queries are completed.

If analysis has determined there is something strange going on with a Relay’s Mailboxes.db file, this can usually be resolved by:

Stop the BESRelay service on the machine having the problem
Create a backup copy of the Mailboxes.db file, typically found at \Program Files (x86)\BigFix Enterprise\BES
Relay\wwwrootbes\mailbox\Mailboxes.db
Delete the Mailboxes.db file
Start the BESRelay service

When you operate a relatively large BigFix environment, you might start seeing in your Top Level Relay (TLR) log files lines similar to the following:

ProcessMailboxQueue (8524) - Error running task ProcessMailboxQueue: HTTP Error 28: Timeout was reached

and corresponding lines similar to the following in your Root Server log:

/mailbox-request (2076) - Uncaught exception in plugin mailbox-request with client <RELAY_NAME>: Socket Error: Windows Error 0x2745: An established connection was aborted by the software in your host machine.

This is indicating the Relay is unable to complete it’s Mailbox refresh (or build) process because the Root Server is taking too long to generate a complete response. Normally this process should complete relatively fast, however when the Relay is requesting the Mailbox site information for 20,000+ Clients and something is not operating properly on the Root Server, the Relay will trigger the timeout condition resulting in the entire process to abort. The Relay will re-try again in the future (as seen in production, roughly every 30 minutes), but if the situation causing the issue on the Root Server is not resolved, the Relay will never complete the Mailbox refresh (or build) process and as such, downstream Clients may not receive new Mailboxed actions. There are some cases where the Relay will request the Mailbox information for a smaller number of Clients (often because of a new Action being taken and the Relay getting to perform the process using a smaller subset of Clients), but what we have found in practice is it is highly inconsistent with no way of knowing which Actions will make it thru. Setting the following Client setting on the Relay having the problem to a value higher than the default (10 seconds), then restarting the BESRelay service might be enough to get the problem resolved:

_HTTPRequestSender_Connect_TimeoutSeconds

Note: This setting effects other operations of the Relay and leaving it defined at a value other than the default could cause other issues down the line. I would only suggest adjusting the setting while actively troubleshooting an issue, making only modest increases until the Relay is able to get past the timeout problem, then dropping it back down when done.

When there is a problem on the Root Server where a Mailbox file (e.g. Action XYZ.fxf which represents the actual action) is no longer found on the disk (or is otherwise inaccessible by the Root Server or corrupted), you will see an entry similar to the following in the Relay log file:

ProcessMailboxQueue (5260) - Error running task ProcessMailboxQueue: Unexpected HTTP response: 404 Not Found

This problem is actually worse than it might initially appear for 2 reasons. First, you might not even see these error messages in your Relay log because the timeout problem masks this issue (which may in fact be the root cause of the slow Root Server response). Second, the Root Server has NO WAY to recover nor even report on missing or damaged Mailbox files (yes, you read that correctly!) Also, because of the way the Relay performs it’s refresh (or build) process, when it encounters one of these missing files, it aborts the entire transaction and will not move past until the referenced file either becomes available again on the Root Server, or the Action which represents the file is stopped and re-taken in the Console.

To identify which file (Action) is causing the issue, you can use one of two methods:

Perform a packet capture (i.e. Wireshark) on the Relay, filtering results looking for any HTTP requests going to the Root Server with “mailbox/files” in the URI, resulting in a HTTP 404 response. Once you see the SHA1 name of the file being requested, you can then search the MAILBOX_FILES table in the database for that SHA1 entry and see what the FileID value is. This should be the same as an Action ID you can lookup in the console. You can use the following T-SQL command to locate the record:

SELECT [FileID] FROM [BFEnterprise].[dbo].[MAILBOX_FILES] WHERE CONVERT(VARCHAR(MAX), [SHA1], 2) = N'<DiscoveredSHA1FilenameFromPacketCapture>'

Perform a dump of the records stored in the MAILBOX_FILES table, then enumerate through the disk on the Root Server until you discover a referenced file (using the SHA1 value stored in the database as the filename) not present. You can use the following T-SQL command to generate the listing of files which are expected to exist in the BES Server\wwwrootbes\mailbox\files directory and the corresponding Action ID you can use to lookup in the console:

SELECT CONVERT(VARCHAR(MAX), [SHA1], 2) AS [Filename], [FileID] FROM [BFEnterprise].[dbo].[MAILBOX_FILES]

Once you have located the problem Action, from within the Console:

Stop the problem Action
Copy the Action to open the Take Action dialog window
Ensure the targeting and all other options are set as needed, then take the Action

Repeat this until all missing files referenced have been removed from the database and replaced with new files representing the Action.

Although not currently documented by IBM anywhere public, the limitation of the Root Server to re-create these Mailbox files is something that needs to be accommodated for in your risk mitigation plans. In addition to the handful of key files IBM already documents and your daily database backups, you should now also include creating backup copies of all files stored in the BES Root Server\wwwrootbes\mailbox\files directory. This is also a very important step to perform when migrating server hardware since failure to copy these files over to the new server will result in possible issues down the road if a Relay ever has to re-build it’s Mailboxes.db.

I have opened up a RFE with IBM to see about getting this capability (or at least a built-in capability to report on the issue) into the product. If you agree this is something fairly important to have, please head over to http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=93824 and vote this RFE up.

To make troubleshooting easier and to really grasp the entirety of the problem, we developed a stand-alone application which captures much of the above (and more) information by querying the data directly from the database, builds a complete relay path hierarchy for each Client, then performs analysis to discover and report on any anomalies detected. By doing this, we can now automate the discovery of problems before they become more wide-spread or impact operations (e.g. a scheduled action doesn’t actually run because Clients can’t gather their Mailbox) and focus our resources directly when and where needed. If there is any demand for the tool, we may release it publicly for all to use.

jgstew · August 31, 2016, 8:34pm

Very helpful post. I definitely have a need for this tool you mention, and I know others that could use it as well.

Very minor nitpick:

The IF/THEN/ELSE isn’t required. It would be exactly equivalent if you did just this:

maxima of site version lists of sites whose("mailboxsite" = name of it)

My attempt to create some properties related to this: https://bigfix.me/analysis/details/2998337

Nice tip!

I don’t know that I ever thought about this before, but it makes sense when I think about it. The mailbox files are encrypted for their destination client in such a way that only that client should be able to decrypt them. The root could keep track of which ones it has created and should still exist, and perhaps if they are valid, but it can’t recreate them.

CC: @cdh

anon87818915 · September 1, 2016, 12:06pm

Yes, you are correct. The results are equivalent (on non-really-broken 8.1+ Clients), but not the same. I do have a handful of systems that unfortunately do not produce the same result, and although the Client does get upgraded when we find them, I prefer to not have <error> results when feeding the data into external applications for analysis.

When I write example properties for folks who are learning relevance from example code, by explicitly stating that a nothing is returned, they are not surprised when they see nothing as a result. I was working with an individual learning relevance when that was written and they were getting confused as they were expecting maximum to return the largest number or zero. Since the Client in question had never successfully gathered it’s Mailbox site but the log file was showing Mailbox site version 0, getting nothing back was a surprise. By wrapping the code I was able to convey additional insight into the problem and they were able to quickly figure out what was happening.

What I would like to see is the Root Server store the result in the database somewhere before writing it to the disk the first time. That way it can just write it back to disk if needed again in the future. I’m sure they chose to not do it this way for a reason, but there has to be a better way of dealing with these issues.

JonL · November 7, 2016, 2:12pm

Very helpful MDMike. +1 to your RFE!

juliac · June 21, 2017, 9:15am

Hi @MDMike, Thanks for the post it help me to reslove oure problem and understan how Mailbox are working.
I didnt found RFE.
Like @jgstew said, we also would be happy to have tool you mention.

Julia

Dave_L · June 22, 2017, 6:30pm

hi Mike,
Excellent post, found it informative.
Any chance I could get a copy of the tool you have mentioned please?

Dave Langridge

anon87818915 · June 23, 2017, 10:55am

I have some cleaning up of the code I have to do first (add better error handling) however since there appears to be a request for it, I will go ahead and make it public.

fermt · June 23, 2017, 2:06pm

I’m also interested on the tool. Let us know when you share it!

BigFixNinja · June 29, 2017, 6:00pm

Other mailbox issues to be aware of:
Troubleshooting problematic BigFix mailbox actions
http://www-01.ibm.com/support/docview.wss?uid=swg22003084

taylorj · August 23, 2017, 5:23pm

I’m adding a comment here to make a correction to the DB query that should be ran for determining the missing mailbox files solution above where the http 404 not found message is found in the Relay log. The query should be:

SELECT CONVERT(VARCHAR(MAX), [SHA1], 2) AS [Filename], [FileID], [Name] FROM [BFEnterprise].[dbo].[MAILBOX_FILES] WHERE [IsDeleted] = 0

The [Name] field will contains Action ???.fxf and Multiple Action Group ???.fxf.

anon87818915 · August 23, 2017, 5:37pm

You beat me to posting the update!

Since I can’t figure out how to edit the initial post, please use the following T-SQL query as the correct replacement for the identification method 2.

SELECT CONVERT(VARCHAR(MAX), [SHA1], 2) AS [Filename], [FileID], [Name] FROM [BFEnterprise].[dbo].[MAILBOX_FILES] WHERE [IsDeleted] = 0