We’re have a “many relays, few clients” setup here too.
We have 40,000 clients spread across 7,000 remote sites, all managed from a central location. Each remote site has a BigFix relay installed. The WAN is a hub-and-spoke design, so the root server and top-level relays are all located in the “hub” .
Our method for avoiding “WAN waves” during gathers was to place all top-level relays behind a router that throttled all BigFix traffic. This worked great (for many years) until the Patches for Windows site was republished during the afternoon on 12/19.
A few minutes after our root server did that gather on the afternoon of 12/19, the networking team came screaming over about how BigFix had completely maxed out the WAN. This was confusing to us, as we were in a production freeze, so there were no running actions in BigFix that would have caused this … an afternoon gather of a site wasn’t even on our minds. Networking team ended up having the firewall block all BigFix WAN traffic so we could end the pain while still researching.
Turns out 1,500 remote site relays were directly connected to the root server… still unsure why, both their primary and secondary top-level relays were available the entire time. With the root server not being throttled and 1,500 remote relays connected to it, whatever communication happens between them during a site gather brought down the WAN ridiculously quickly.
So now we’re looking at implementing the “fake root” setup. Staggering and scheduling what happens between relays when gathers happen would be really nice to have in our environment.