Windows patching of Bigfix Servers best practices?

RichardB · April 4, 2016, 1:31pm

Hi all,

Recently we patched all our Windows-based Bigfix servers (root, SQL and relays). Although we “survived”, some questions about how to go about it properly have risen.

The relays we just patched, rebooted and verified operation afterwards. The clients seemed to have no issues with a relay being temporarily absent. Higher up, even a Top Level Relay and the Root Server itself did not pose a problem either. Even these can be rebooted without issues (apparently). After restarting we checked the downstream relays for any issues. None found…

However, pulling the proverbial plug on the SQL server hosting the Bigfix database was another story. Just before rebooting the SQL server we ran a script stopping the Bigfix services on the root server, stopping it in its tracks. After SQL was up-and-running again, another script was used for restarting the services, resuming operation. This procedure seems to work, though the Top Level Relay sometimes needed a “kick” by restarting the BESRelay and/or BESClient service. During the process, however, we hit a little “bump”. At one point the SQL server was offline more then a few minutes caused by a roll-back (don’t you just hate that?) making the FillDB size on the Top Level Relay growing up to the point of 50 percent. Still no worries at this point but then it just “dumped” its contents just to start over at 0 percent (?). So that data was lost but it should not be a problem (hopefully). Another thing was that after everything was patched and running again we found a gather status warning for the Top Level Relay in the Console health overview even though inspection of the XML file on the server revealed no problems. This was (thank you) resolved after 30 mins or so without any intervention.

So after all this we just wondered: was the approach we took flawed? Is there any best practice or typical procedure for going through a patching cycle? Any input or experiences appreciated!

jmaple · April 4, 2016, 2:41pm

Richard,

The data your relay “dumped”. are you sure that data was really lost? The relay should be holding onto that data until it puts it in the database. If the database doesn’t come up, eventually the relay stops taking reports and pushes back on the clients sending them until it can clear the FillDB backlog as the FillDB queue would be full. It should never “dump” anything as far as I know.

Most of the issues I come across with patching our infrastructure come with the Web Reports not having immediate access to the SQL back-end and producing and error that will stop scheduled reports from being send if it’s not cleared. Other than that, we don’t normally notice the SQL back-end being down unless we are logged into the console and see those errors.

gearoid · April 4, 2016, 8:52pm

Best practices for upgrade are here on the BigFix wiki.
Including approaches to reduce risk of having your FillDB get backlogged.

RichardB · April 5, 2016, 7:29am

Regarding the “dumped” data: the root server and SQL DB were both offline when we witnessed the FillDB on the Top Level Relay getting reset back from 50 to 0 percent. No idea where the data could have gone. From what we understood it should indeed create a backlog until it reaches 100% and then start to reject new incoming data (thus “filling up” other relays further down the chain). At least that would have made sense.