I Accidentally targeted `All Computers` when testing

ageorgiev · April 4, 2016, 10:36am

We had a major outage couple of years ago because a CO operator did something like that and it was a baseline that was removing computers from the domain and shutting them down. As a result we implemented a several precaution/limitation measures and we also have an open RFE (#35984) for peer-review system built-in (IBM at the time promised us that it will be done soon but with the emphasis of WebUI rather than the console, it hasn’t materialised just yet). Anyway, here are the measures we put in place to safeguard against something like this happening:

Turned ON Action Overview pop-up (Advanced System Option “gtsConfirmAction” = true), so as you are submitting an action it will give the user another chance to change their mind
Turned OFF ability for CO operators to dynamically target machines (Advanced System Option “disableNmoDynamicTargeting” = true). There is already ways to limit the amount of machines that can be target by selecting them manually or pasting a list but nothing controls dynamic targeting, so we just had it disabled for CO operators altogether.
Built our own “workaround” peer-review system and applied it to selected “high risk” tasks/fixlets/baselines. The system utilizes the Action Settings Locks of “Run only when” criteria, so when an user runs a high-risk action it goes to status of “Constrained” and then the users have to run a second task to “approve” the initial task (the “approve” task just sets a client setting, that matches the criteria from the original task). We further made the “approve” task only to limited higher-level operators (separate roles were set-up), so essentially not everybody can review/approve actions. Hopefully, we might get a fully functional peer-review system built-in soon, because this is not ideal but does the job for now.
Removed “write” and “create content” permissions for all CO operators for all custom sites and force everybody to utilize Development environment with a handful of machines available in it for all development and testing, so untested and unproven content is not ran on production machines.

With these 4 in place, we are safe that at least two people review each high-risk action and even if both make mistake and erroneous action is ran it cannot have a big impact (any CO operator cannot run action on more than 200 machines). Maybe some/all of those will be of use to you.