Hi all,
Recently we patched all our Windows-based Bigfix servers (root, SQL and relays). Although we “survived”, some questions about how to go about it properly have risen.
The relays we just patched, rebooted and verified operation afterwards. The clients seemed to have no issues with a relay being temporarily absent. Higher up, even a Top Level Relay and the Root Server itself did not pose a problem either. Even these can be rebooted without issues (apparently). After restarting we checked the downstream relays for any issues. None found…
However, pulling the proverbial plug on the SQL server hosting the Bigfix database was another story. Just before rebooting the SQL server we ran a script stopping the Bigfix services on the root server, stopping it in its tracks. After SQL was up-and-running again, another script was used for restarting the services, resuming operation. This procedure seems to work, though the Top Level Relay sometimes needed a “kick” by restarting the BESRelay and/or BESClient service. During the process, however, we hit a little “bump”. At one point the SQL server was offline more then a few minutes caused by a roll-back (don’t you just hate that?) making the FillDB size on the Top Level Relay growing up to the point of 50 percent. Still no worries at this point but then it just “dumped” its contents just to start over at 0 percent (?). So that data was lost but it should not be a problem (hopefully). Another thing was that after everything was patched and running again we found a gather status warning for the Top Level Relay in the Console health overview even though inspection of the XML file on the server revealed no problems. This was (thank you) resolved after 30 mins or so without any intervention.
So after all this we just wondered: was the approach we took flawed? Is there any best practice or typical procedure for going through a patching cycle? Any input or experiences appreciated!